Re: [REPORT] cfs-v4 vs sd-0.44
In article <[EMAIL PROTECTED]> you wrote: > a) it may do so for a short and bound time, typically less than the > maximum acceptable latency for other tasks if you have n threads in runq and each of them can have mhttp://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote: >>> Adjustments to the lag computation for for arrivals and departures >>> during execution are among the missing pieces. Some algorithmic devices >>> are also needed to account for the varying growth rates of lags of tasks >>> waiting to run, which arise from differing priorities/weights. On Wed, 2007-04-25 at 22:13 +0200, Willy Tarreau wrote: >> that was the principle of my proposal of sorting tasks by expected completion >> time and using +/- credit to compensate for too large/too short slice used. On Thu, Apr 26, 2007 at 10:57:48AM -0700, Li, Tong N wrote: > Yeah, it's a good algorithm. It's a variant of earliest deadline first > (EDF). There are also similar ones in the literature such as earliest > eligible virtual deadline first (EEVDF) and biased virtual finishing > time (BVFT). Based on wli's explanation, I think Ingo's approach would > also fall into this category. With careful design, all such algorithms > that order tasks based on some notion of time can achieve good fairness. > There are some subtle differences. Some algorithms of this type can > achieve a constant lag bound, but some only have a constant positive lag > bound, but O(N) negative lag bound, meaning some tasks could receive > much more CPU time than it would under ideal fairness when the number of > tasks is high. The algorithm is in a bit of flux, but the virtual deadline computation is rather readable. You may be able to tell whether cfs is affected by the negative lag issue better than I. For the most part all I can smoke out is that it's not apparent to me whether load balancing is done the way it needs to be. On Thu, Apr 26, 2007 at 10:57:48AM -0700, Li, Tong N wrote: > On the other hand, the log(N) complexity of this type of algorithms has > been a concern in the research community. This motivated O(1) > round-robin based algorithms such as deficit round-robin (DRR) and > smoothed round-robin (SRR) in networking, and virtual-time round-robin > (VTRR), group ratio round-robin (GP3) and grouped distributed queues > (GDQ) in OS scheduling, as well as the distributed weighted round-robin > (DWRR) one I posted earlier. I'm going to make a bold statement: I don't think O(lg(n)) is bad at all. In real systems there are constraints related to per-task memory footprints that severely restrict the domain of the performance metric, rendering O(lg(n)) bounded by a rather reasonable constant. A larger concern to me is whether this affair actually achieves its design goals and, to a lesser extent, in what contexts those design goals are truly crucial or dominant as opposed to others, such as, say, interactivity. It is clear, regardless of general applicability, that the predictability of behavior with regard to strict fairness is going to be useful in certain contexts. Another concern which is in favor of the virtual deadline design is that virtual deadlines can very effectively emulate a broad spectrum of algorithms. For instance, the mainline "O(1) scheduler" can be emulated using such a queueing mechanism. Even if the particular policy cfs now implemented is dumped, radically different policies can be expressed with its queueing mechanism. This has maintainence implications which are quite beneficial. That said, it's far from an unqualified endorsement. I'd still like to see much done differently. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Thu, Apr 26, 2007 at 10:57:48AM -0700, Li, Tong N wrote: > On Wed, 2007-04-25 at 22:13 +0200, Willy Tarreau wrote: > > On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote: > > > > > Adjustments to the lag computation for for arrivals and departures > > > during execution are among the missing pieces. Some algorithmic devices > > > are also needed to account for the varying growth rates of lags of tasks > > > waiting to run, which arise from differing priorities/weights. > > > > that was the principle of my proposal of sorting tasks by expected > > completion > > time and using +/- credit to compensate for too large/too short slice used. > > > > Willy > > Yeah, it's a good algorithm. It's a variant of earliest deadline first > (EDF). There are also similar ones in the literature such as earliest > eligible virtual deadline first (EEVDF) and biased virtual finishing > time (BVFT). Based on wli's explanation, I think Ingo's approach would > also fall into this category. With careful design, all such algorithms > that order tasks based on some notion of time can achieve good fairness. > There are some subtle differences. Some algorithms of this type can > achieve a constant lag bound, but some only have a constant positive lag > bound, but O(N) negative lag bound, Anyway, we're working in discrete time, not linear time. Lag is unavoidable. At best it can be bounded and compensated for. First time I thought about this algorithm, I was looking at a line drawn using the Bresenham algorithm. This algorithm is all about error compensation for a perfect expectation. Too high, too far, too high, too far... I thought that the line could represent a task progress as a function of time, and the pixels the periods the task spends on the CPU. On short intervals, you lose. On large ones, you're very close to the ideal case. > meaning some tasks could receive > much more CPU time than it would under ideal fairness when the number of > tasks is high. It's not a problem that a task receives much more CPU than it should, provided that : a) it may do so for a short and bound time, typically less than the maximum acceptable latency for other tasks b) the excess of CPU it received is accounted for so that it is deduced from next passes. > On the other hand, the log(N) complexity of this type of algorithms has > been a concern in the research community. This motivated O(1) I've been for O(1) algorithms for a long time, but seeing how dumb the things you can do in O(1) are compared to O(logN), I definitely switched my mind. Also, in O(logN), you often have a lot of common operations still O(1). Eg: you insert in O(logN) in a time-ordered tree, but you read from it in O(1). But you still have the ability to change its content in O(logN) if you need. Last but not least, spending one hundred cycles a few thousand times a second is nothing compared to electing the wrong task for a full time-slice. Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Wed, 2007-04-25 at 22:13 +0200, Willy Tarreau wrote: > On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote: > > > Adjustments to the lag computation for for arrivals and departures > > during execution are among the missing pieces. Some algorithmic devices > > are also needed to account for the varying growth rates of lags of tasks > > waiting to run, which arise from differing priorities/weights. > > that was the principle of my proposal of sorting tasks by expected completion > time and using +/- credit to compensate for too large/too short slice used. > > Willy Yeah, it's a good algorithm. It's a variant of earliest deadline first (EDF). There are also similar ones in the literature such as earliest eligible virtual deadline first (EEVDF) and biased virtual finishing time (BVFT). Based on wli's explanation, I think Ingo's approach would also fall into this category. With careful design, all such algorithms that order tasks based on some notion of time can achieve good fairness. There are some subtle differences. Some algorithms of this type can achieve a constant lag bound, but some only have a constant positive lag bound, but O(N) negative lag bound, meaning some tasks could receive much more CPU time than it would under ideal fairness when the number of tasks is high. On the other hand, the log(N) complexity of this type of algorithms has been a concern in the research community. This motivated O(1) round-robin based algorithms such as deficit round-robin (DRR) and smoothed round-robin (SRR) in networking, and virtual-time round-robin (VTRR), group ratio round-robin (GP3) and grouped distributed queues (GDQ) in OS scheduling, as well as the distributed weighted round-robin (DWRR) one I posted earlier. tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SD renice recommendation was: Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007 16:36, Ingo Molnar wrote: > So, my point is, the nice level of X for desktop users should not be set > lower than a low limit suggested by that particular scheduler's author. > That limit is scheduler-specific. Con i think recommends a nice level of > -1 for X when using SD [Con, can you confirm?], while my tests show that > if you want you can go as low as -10 under CFS, without any bad > side-effects. (-19 was a bit too much) Nice 0 as a default for X, but if renicing, nice -10 as the lower limit for X on SD. The reason for that on SD is that the priority of freshly woken up tasks (ie not fully cpu bound) for both nice 0 and nice -10 will still be the same at PRIO 1 (see the prio_matrix). Therefore, there will _not_ be preemption of the nice 0 task and a context switch _unless_ it is already cpu bound and has consumed a certain number of cycles and has been demoted. Contrary to popular belief, it is not universal that a less niced task will preempt its more niced counterpart and depends entirely on implementation of nice. Yes it is true that context switch rate will go up with a reniced X because the conditions that lead to preemption are more likely to be met, but it is definitely not every single wakeup of the reniced X. Alas, again, I am forced to spend as little time as possible at the pc for my health, so expect _very few_ responses via email from me. Luckily SD is in pretty fine shape with version 0.46. -- -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote: > Adjustments to the lag computation for for arrivals and departures > during execution are among the missing pieces. Some algorithmic devices > are also needed to account for the varying growth rates of lags of tasks > waiting to run, which arise from differing priorities/weights. that was the principle of my proposal of sorting tasks by expected completion time and using +/- credit to compensate for too large/too short slice used. Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Li, Tong N <[EMAIL PROTECTED]> wrote: >> [...] A corollary of this is that if both threads i and j are >> continuously runnable with fixed weights in the time interval, then >> the ratio of their CPU time should be equal to the ratio of their >> weights. This definition is pretty restrictive since it requires the >> properties to hold for any thread in any interval, which is not >> feasible. [...] On Wed, Apr 25, 2007 at 11:44:03AM +0200, Ingo Molnar wrote: > yes, it's a pretty strong definition, but also note that while it is > definitely not easy to implement, the solution is nevertheless feasible > in my opinion and there exists a scheduler that implements it: CFS. The feasibility comment refers to the unimplementability of schedulers with infinitesimal timeslices/quanta/sched_granularity_ns. It's no failing of cfs (or any other scheduler) if, say, the ratios are not exact within a time interval of one nanosecond or one picosecond. One of the reasons you get the results you do is that what you use for ->fair_key is very close to the definition of lag, which is used as a metric of fairness. It differs is in a couple of ways, but how it's computed and used for queueing can be altered to more precisely match. The basic concept you appear to be trying to implement is a greedy algorithm: run the task with the largest lag first. As far as I can tell, this is sound enough, though I have no formal proof. So with the lag computation and queueing adjusted appropriately, it should work out. Adjustments to the lag computation for for arrivals and departures during execution are among the missing pieces. Some algorithmic devices are also needed to account for the varying growth rates of lags of tasks waiting to run, which arise from differing priorities/weights. There are no mysteries. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
> > it into some xorg.conf field. (It also makes sure that X isnt preempted > > by other userspace stuff while it does timing-sensitive operations like > > setting the video modes up or switching video modes, etc.) > > X is priviledged. It can just cli around the critical section. Not really. X can use iopl3 but if it disables interrupts you get priority inversions and hangs, so in practice it can't do that. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Li, Tong N <[EMAIL PROTECTED]> wrote: > [...] A corollary of this is that if both threads i and j are > continuously runnable with fixed weights in the time interval, then > the ratio of their CPU time should be equal to the ratio of their > weights. This definition is pretty restrictive since it requires the > properties to hold for any thread in any interval, which is not > feasible. [...] yes, it's a pretty strong definition, but also note that while it is definitely not easy to implement, the solution is nevertheless feasible in my opinion and there exists a scheduler that implements it: CFS. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Ray Lee <[EMAIL PROTECTED]> wrote: > It would seem like there should be a penalty associated with sending > those points as well, so that two processes communicating quickly with > each other won't get into a mutual love-fest that'll capture the > scheduler's attention. it's not really "points", but "nanoseconds you are allowed to execute on the CPU". And thus two processes communicating with each other quickly and sending around this resource does get the attention of CFS: the resource is gradually consumed because the two processes are running on the CPU while they are communicating with each other. So it all works out fine. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Rogan Dawes <[EMAIL PROTECTED]> wrote: > My concern was that since Ingo said that this is a closed economy, > with a fixed sum/total, if we lose a nanosecond here and there, > eventually we'll lose them all. it's not a closed economy - the CPU constantly produces a resource: "CPU cycles to be spent", and tasks constantly consume that resource. So in that sense small inaccuracies are not a huge issue. But you are correct that each and every such inaccuracy has to be justified. For example larger inaccuracies on the order of SCHED_LOAD_SCALE are a problem because they can indeed sum up, and i fixed up a couple of such inaccuracies in -v6-to-be. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Pavel Machek <[EMAIL PROTECTED]> wrote: > > it into some xorg.conf field. (It also makes sure that X isnt > > preempted by other userspace stuff while it does timing-sensitive > > operations like setting the video modes up or switching video modes, > > etc.) > > X is priviledged. It can just cli around the critical section. yes, that is a tool that can be used too (and is used by most drivers) - my point was rather that besides the disadvantages, not preempting X can be an advantage too - not that there are no other (and often more suitable) tools to do the same. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Hi! > it into some xorg.conf field. (It also makes sure that X isnt preempted > by other userspace stuff while it does timing-sensitive operations like > setting the video modes up or switching video modes, etc.) X is priviledged. It can just cli around the critical section. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tue, Apr 24, 2007 at 06:22:53PM -0700, Li, Tong N wrote: > The goal of a proportional-share scheduling algorithm is to minimize the > above metrics. If the lag function is bounded by a constant for any > thread in any time interval, then the algorithm is considered to be > fair. You may notice that the second metric is actually weaker than > first. In fact, if an algorithm achieves a constant lag bound, it must > also achieve a constant bound for the second metric, but the reverse is > not necessarily true. But in some settings, people have focused on the > second metric and still consider an algorithm to be fair as long as the > second metric is bounded by a constant. Using these metrics it is possible to write benchmarks quantifying fairness as a performance metric, provided weights for nice numbers. Not so coincidentally, this also entails a test of whether nice numbers are working as intended. -- wli P.S. Divide by the length of the time interval to rephrase in terms of CPU bandwidth. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [REPORT] cfs-v4 vs sd-0.44
> Could you explain for the audience the technical definition of fairness > and what sorts of error metrics are commonly used? There seems to be > some disagreement, and you're neutral enough of an observer that your > statement would help. The definition for proportional fairness assumes that each thread has a weight, which, for example, can be specified by the user, or sth. mapped from thread priorities, nice values, etc. A scheduler achieves ideal proportional fairness if (1) it is work-conserving, i.e., it never leaves a processor idle if there are runnable threads, and (2) for any two threads, i and j, in any time interval, the ratio of their CPU time is greater than or equal to the ratio of their weights, assuming that thread i is continuously runnable in the entire interval and both threads have fixed weights throughout the interval. A corollary of this is that if both threads i and j are continuously runnable with fixed weights in the time interval, then the ratio of their CPU time should be equal to the ratio of their weights. This definition is pretty restrictive since it requires the properties to hold for any thread in any interval, which is not feasible. In practice, all algorithms try to approximate this ideal scheduler (often referred to as Generalized Processor Scheduling or GPS). Two error metrics are often used: (1) lag(t): for any interval [t1, t2], the lag of a thread at time t \in [t1, t2] is S'(t1, t) - S(t1, t), where S' is the CPU time the thread would receive in the interval [t1, t] under the ideal scheduler and S is the actual CPU time it receives under the scheduler being evaluated. (2) The second metric doesn't really have an agreed-upon name. Some call it fairness measure and some call it sth else. Anyway, different from lag, which is kind of an absolute measure for one thread, this metric (call it F) defines a relative measure between two threads over any time interval: F(t1, t2) = S_i(t1, t2) / w_i - S_j(t1, t2) / w_j, where S_i and S_j are the CPU time the two threads receive in the interval [t1, t2] and w_i and w_j are their weights, assuming both weights don't change throughout the interval. The goal of a proportional-share scheduling algorithm is to minimize the above metrics. If the lag function is bounded by a constant for any thread in any time interval, then the algorithm is considered to be fair. You may notice that the second metric is actually weaker than first. In fact, if an algorithm achieves a constant lag bound, it must also achieve a constant bound for the second metric, but the reverse is not necessarily true. But in some settings, people have focused on the second metric and still consider an algorithm to be fair as long as the second metric is bounded by a constant. > > On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote: > > I understand that via experiments we can show a design is reasonably > > fair in the common case, but IMHO, to claim that a design is fair, there > > needs to be some kind of formal analysis on the fairness bound, and this > > bound should be proven to be constant. Even if the bound is not > > constant, at least this analysis can help us better understand and > > predict the degree of fairness that users would experience (e.g., would > > the system be less fair if the number of threads increases? What happens > > if a large number of threads dynamically join and leave the system?). > > Carrying out this sort of analysis on various policies would help, but > I'd expect most of them to be difficult to analyze. cfs' current > ->fair_key computation should be simple enough to analyze, at least > ignoring nice numbers, though I've done nothing rigorous in this area. > If we can derive some invariants from the algorithm, it'd help the analysis. An example is the deficit round-robin (DRR) algorithm in networking. Its analysis utilizes the fact that the round each flow (in this case, it'd be thread) goes through in any time interval differs by at most one. Hope you didn't get bored by all of this. :) tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Willy Tarreau wrote: >On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote: >> On Tuesday 24 April 2007, Ingo Molnar wrote: >> >* David Lang <[EMAIL PROTECTED]> wrote: >> >> > (Btw., to protect against such mishaps in the future i have changed >> >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only >> >> > change real-time tasks to SCHED_OTHER, but to also renice negative >> >> > nice levels back to 0 - this will show up in -v6. That way you'd >> >> > only have had to hit SysRq-N to get the system out of the wedge.) >> >> >> >> if you are trying to unwedge a system it may be a good idea to renice >> >> all tasks to 0, it could be that a task at +19 is holding a lock that >> >> something else is waiting for. >> > >> >Yeah, that's possible too, but +19 tasks are getting a small but >> >guaranteed share of the CPU so eventually it ought to release it. It's >> >still a possibility, but i think i'll wait for a specific incident to >> >happen first, and then react to that incident :-) >> > >> >Ingo >> >> In the instance I created, even the SysRq+b was ignored, and ISTR thats >> supposed to initiate a reboot is it not? So it was well and truly wedged. > >On many machines I use this on, I have to release Alt while still holding B. >Don't know why, but it works like this. > >Willy Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the reputation for needing virgins to sacrifice, regular experienced girls just wouldn't do. This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need a sceance or something to conjure up the right spell. Besides, the reset button is only about 6 feet away... I get some execsize that way by getting up to push it. :) -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) It is so soon that I am done for, I wonder what I was begun for. -- Epitaph, Cheltenham Churchyard - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Willy Tarreau wrote: >On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote: >> On Tuesday 24 April 2007, Ingo Molnar wrote: >> >* David Lang <[EMAIL PROTECTED]> wrote: >> >> > (Btw., to protect against such mishaps in the future i have changed >> >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only >> >> > change real-time tasks to SCHED_OTHER, but to also renice negative >> >> > nice levels back to 0 - this will show up in -v6. That way you'd >> >> > only have had to hit SysRq-N to get the system out of the wedge.) >> >> >> >> if you are trying to unwedge a system it may be a good idea to renice >> >> all tasks to 0, it could be that a task at +19 is holding a lock that >> >> something else is waiting for. >> > >> >Yeah, that's possible too, but +19 tasks are getting a small but >> >guaranteed share of the CPU so eventually it ought to release it. It's >> >still a possibility, but i think i'll wait for a specific incident to >> >happen first, and then react to that incident :-) >> > >> >Ingo >> >> In the instance I created, even the SysRq+b was ignored, and ISTR thats >> supposed to initiate a reboot is it not? So it was well and truly wedged. > >On many machines I use this on, I have to release Alt while still holding B. >Don't know why, but it works like this. > >Willy Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the reputation for needing virgins to sacrifice, regular experienced girls just wouldn't do. This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need a sceance or something to conjure up the right spell. Besides, the reset button is only about 6 feet away... I get some execsize that way by getting up to push it. :) -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) It is so soon that I am done for, I wonder what I was begun for. -- Epitaph, Cheltenham Churchyard - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Rogan Dawes wrote: Chris Friesen wrote: Rogan Dawes wrote: I guess my point was if we somehow get to an odd number of nanoseconds, we'd end up with rounding errors. I'm not sure if your algorithm will ever allow that. And Ingo's point was that when it takes thousands of nanoseconds for a single context switch, an error of half a nanosecond is down in the noise. Chris My concern was that since Ingo said that this is a closed economy, with a fixed sum/total, if we lose a nanosecond here and there, eventually we'll lose them all. Some folks have uptimes of multiple years. Of course, I could (very likely!) be full of it! ;-) And won't be using the any new scheduler on these computers anyhow as that would involve bringing the system down to install the new kernel. :-) Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
In article <[EMAIL PROTECTED]> you wrote: > Could you explain for the audience the technical definition of fairness > and what sorts of error metrics are commonly used? There seems to be > some disagreement, and you're neutral enough of an observer that your > statement would help. And while we are at it, why it is a good thing. I could understand that fair means no missbehaving (intentionally or unintentionally) application can harm the rest of the system. However a responsive desktop might not necesarily be very fair to compute jobs. Even a simple thing as "who gets accounted" can be quite different in different workloads. (larger multi user systems tend to be fair based on the user, on servers you more balance by thread or job and single user systems should be as unfair as the user wants them as long as no process can "run away") Gruss Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote: > I don't know if we've discussed this or not. Since both CFS and SD claim > to be fair, I'd like to hear more opinions on the fairness aspect of > these designs. In areas such as OS, networking, and real-time, fairness, > and its more general form, proportional fairness, are well-defined > terms. In fact, perfect fairness is not feasible since it requires all > runnable threads to be running simultaneously and scheduled with > infinitesimally small quanta (like a fluid system). So to evaluate if a > new scheduling algorithm is fair, the common approach is to take the > ideal fair algorithm (often referred to as Generalized Processor > Scheduling or GPS) as a reference model and analyze if the new algorithm > can achieve a constant error bound (different error metrics also exist). Could you explain for the audience the technical definition of fairness and what sorts of error metrics are commonly used? There seems to be some disagreement, and you're neutral enough of an observer that your statement would help. On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote: > I understand that via experiments we can show a design is reasonably > fair in the common case, but IMHO, to claim that a design is fair, there > needs to be some kind of formal analysis on the fairness bound, and this > bound should be proven to be constant. Even if the bound is not > constant, at least this analysis can help us better understand and > predict the degree of fairness that users would experience (e.g., would > the system be less fair if the number of threads increases? What happens > if a large number of threads dynamically join and leave the system?). Carrying out this sort of analysis on various policies would help, but I'd expect most of them to be difficult to analyze. cfs' current ->fair_key computation should be simple enough to analyze, at least ignoring nice numbers, though I've done nothing rigorous in this area. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Mon, 2007-04-23 at 18:57 -0700, Bill Huey wrote: > On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote: > > I don't know if we've discussed this or not. Since both CFS and SD claim > > to be fair, I'd like to hear more opinions on the fairness aspect of > > these designs. In areas such as OS, networking, and real-time, fairness, > > and its more general form, proportional fairness, are well-defined > > terms. In fact, perfect fairness is not feasible since it requires all > > runnable threads to be running simultaneously and scheduled with > > infinitesimally small quanta (like a fluid system). So to evaluate if a > > Unfortunately, fairness is rather non-formal in this context and probably > isn't strictly desirable given how hack much of Linux userspace is. Until > there's a method of doing directed yields, like what Will has prescribed > a kind of allotment to thread doing work for another a completely strict > mechanism, it is probably problematic with regards to corner cases. > > X for example is largely non-thread safe. Until they can get their xcb > framework in place and addition thread infrastructure to do hand off > properly, it's going to be difficult schedule for it. It's well known to > be problematic. I agree. I just think calling the designs "perfectly" or "completely" fair is too strong. It might cause unnecessary confusion that overshadows the actual merits of these designs. If we were to evaluate specifically the fairness aspect of a design, then I'd suggest defining it more formally. > You announced your scheduler without CCing any of the relevant people here > (and risk being completely ignored in lkml traffic): > > http://lkml.org/lkml/2007/4/20/286 > > What is your opinion of both CFS and SDL ? How can you work be useful > to either scheduler mentioned or to the Linux kernel on its own ? I like SD for its simplicity. My concern with CFS is the RB tree structure. Log(n) seems high to me given the fact that we had an O(1) scheduler. Many algorithms achieve strong fairness guarantees at the cost of log(n) time. Thus, I tend to think, if log(n) is acceptable, we might want also to look at other algorithms (e.g., start-time first) with better fairness properties and see if they could be extended to be general purpose. > > I understand that via experiments we can show a design is reasonably > > fair in the common case, but IMHO, to claim that a design is fair, there > > needs to be some kind of formal analysis on the fairness bound, and this > > bound should be proven to be constant. Even if the bound is not > > constant, at least this analysis can help us better understand and > > predict the degree of fairness that users would experience (e.g., would > > the system be less fair if the number of threads increases? What happens > > if a large number of threads dynamically join and leave the system?). > > Will has been thinking about this, but you have to also consider the > practicalities of your approach versus Con's and Ingo's. I consider my work an approach to extend an existing scheduler to support proportional fairness. I see many proportional-share designs are lacking things such as good interactive support that Linux does well. This is why I designed it on top of the existing scheduler so that it can leverage things such as dynamic priorities. Regardless of the underlying scheduler, SD or CFS, I think the algorithm I used would still apply and thus we can extend the scheduler similarly. > I'm all for things like proportional scheduling and the extensions > needed to do it properly. It would be highly relevant to some version > of the -rt patch if not that patch directly. I'd love it to be considered for part of the -rt patch. I'm new to this, so would you please let me know what to do? Thanks, tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote: > On Tuesday 24 April 2007, Ingo Molnar wrote: > >* David Lang <[EMAIL PROTECTED]> wrote: > >> > (Btw., to protect against such mishaps in the future i have changed > >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only > >> > change real-time tasks to SCHED_OTHER, but to also renice negative > >> > nice levels back to 0 - this will show up in -v6. That way you'd > >> > only have had to hit SysRq-N to get the system out of the wedge.) > >> > >> if you are trying to unwedge a system it may be a good idea to renice > >> all tasks to 0, it could be that a task at +19 is holding a lock that > >> something else is waiting for. > > > >Yeah, that's possible too, but +19 tasks are getting a small but > >guaranteed share of the CPU so eventually it ought to release it. It's > >still a possibility, but i think i'll wait for a specific incident to > >happen first, and then react to that incident :-) > > > > Ingo > > In the instance I created, even the SysRq+b was ignored, and ISTR thats > supposed to initiate a reboot is it not? So it was well and truly wedged. On many machines I use this on, I have to release Alt while still holding B. Don't know why, but it works like this. Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Rogan Dawes wrote: My concern was that since Ingo said that this is a closed economy, with a fixed sum/total, if we lose a nanosecond here and there, eventually we'll lose them all. I assume Ingo has set it up so that the system doesn't "lose" partial nanoseconds, but rather they'd just be accounted to the wrong task. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On 4/23/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: On Mon, 23 Apr 2007, Ingo Molnar wrote: > > The "give scheduler money" transaction can be both an "implicit > transaction" (for example when writing to UNIX domain sockets or > blocking on a pipe, etc.), or it could be an "explicit transaction": > sched_yield_to(). This latter i've already implemented for CFS, but it's > much less useful than the really significant implicit ones, the ones > which will help X. Yes. It would be wonderful to get it working automatically, so please say something about the implementation.. The "perfect" situation would be that when somebody goes to sleep, any extra points it had could be given to whoever it woke up last. Note that for something like X, it means that the points are 100% ephemeral: it gets points when a client sends it a request, but it would *lose* the points again when it sends the reply! It would seem like there should be a penalty associated with sending those points as well, so that two processes communicating quickly with each other won't get into a mutual love-fest that'll capture the scheduler's attention. Ray - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Chris Friesen wrote: Rogan Dawes wrote: I guess my point was if we somehow get to an odd number of nanoseconds, we'd end up with rounding errors. I'm not sure if your algorithm will ever allow that. And Ingo's point was that when it takes thousands of nanoseconds for a single context switch, an error of half a nanosecond is down in the noise. Chris My concern was that since Ingo said that this is a closed economy, with a fixed sum/total, if we lose a nanosecond here and there, eventually we'll lose them all. Some folks have uptimes of multiple years. Of course, I could (very likely!) be full of it! ;-) Rogan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Rogan Dawes wrote: I guess my point was if we somehow get to an odd number of nanoseconds, we'd end up with rounding errors. I'm not sure if your algorithm will ever allow that. And Ingo's point was that when it takes thousands of nanoseconds for a single context switch, an error of half a nanosecond is down in the noise. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: >* Ingo Molnar <[EMAIL PROTECTED]> wrote: >> yeah, i guess this has little to do with X. I think in your scenario >> it might have been smarter to either stop, or to renice the workloads >> that took away CPU power from others to _positive_ nice levels. >> Negative nice levels can indeed be dangerous. > >btw., was X itself at nice 0 or nice -10 when the lockup happened? > > Ingo Memory could be fuzzy Ingo, but I think it was at 0 at the time. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) I know it all. I just can't remember it all at once. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: >* Ingo Molnar <[EMAIL PROTECTED]> wrote: >> yeah, i guess this has little to do with X. I think in your scenario >> it might have been smarter to either stop, or to renice the workloads >> that took away CPU power from others to _positive_ nice levels. >> Negative nice levels can indeed be dangerous. > >btw., was X itself at nice 0 or nice -10 when the lockup happened? > > Ingo Memory could be hazy Ingo, but I think X was at 0 when that occurred. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) I use technology in order to hate it more properly. -- Nam June Paik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: >* David Lang <[EMAIL PROTECTED]> wrote: >> > (Btw., to protect against such mishaps in the future i have changed >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only >> > change real-time tasks to SCHED_OTHER, but to also renice negative >> > nice levels back to 0 - this will show up in -v6. That way you'd >> > only have had to hit SysRq-N to get the system out of the wedge.) >> >> if you are trying to unwedge a system it may be a good idea to renice >> all tasks to 0, it could be that a task at +19 is holding a lock that >> something else is waiting for. > >Yeah, that's possible too, but +19 tasks are getting a small but >guaranteed share of the CPU so eventually it ought to release it. It's >still a possibility, but i think i'll wait for a specific incident to >happen first, and then react to that incident :-) > > Ingo In the instance I created, even the SysRq+b was ignored, and ISTR thats supposed to initiate a reboot is it not? So it was well and truly wedged. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) I use technology in order to hate it more properly. -- Nam June Paik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: >* Gene Heskett <[EMAIL PROTECTED]> wrote: >> > (Btw., to protect against such mishaps in the future i have changed >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only >> > change real-time tasks to SCHED_OTHER, but to also renice negative >> > nice levels back to 0 - this will show up in -v6. That way you'd >> > only have had to hit SysRq-N to get the system out of the wedge.) >> >> That sounds handy, particularly with idiots like me at the wheel... > >by that standard i guess we tinkerers are all idiots ;) > > Ingo Eiyyyup! -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Man's horizons are bounded by his vision. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Ingo Molnar wrote: * Rogan Dawes <[EMAIL PROTECTED]> wrote: if (p_to && p->wait_runtime > 0) { p->wait_runtime >>= 1; p_to->wait_runtime += p->wait_runtime; } the above is the basic expression of: "charge a positive bank balance". [..] [note, due to the nanoseconds unit there's no rounding loss to worry about.] Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss? yes. But not that we'll only truly have to worry about that when we'll have context-switching performance in that range - currently it's at least 2-3 orders of magnitude above that. Microseconds seemed to me to be too coarse already, that's why i picked nanoseconds and 64-bit arithmetics for CFS. Ingo I guess my point was if we somehow get to an odd number of nanoseconds, we'd end up with rounding errors. I'm not sure if your algorithm will ever allow that. Rogan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > [...] That way you'd only have had to hit SysRq-N to get the system > out of the wedge.) small correction: Alt-SysRq-N. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Rogan Dawes <[EMAIL PROTECTED]> wrote: > >if (p_to && p->wait_runtime > 0) { > >p->wait_runtime >>= 1; > >p_to->wait_runtime += p->wait_runtime; > >} > > > >the above is the basic expression of: "charge a positive bank balance". > > > > [..] > > > [note, due to the nanoseconds unit there's no rounding loss to worry > > about.] > > Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss? yes. But not that we'll only truly have to worry about that when we'll have context-switching performance in that range - currently it's at least 2-3 orders of magnitude above that. Microseconds seemed to me to be too coarse already, that's why i picked nanoseconds and 64-bit arithmetics for CFS. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* David Lang <[EMAIL PROTECTED]> wrote: > > (Btw., to protect against such mishaps in the future i have changed > > the SysRq-N [SysRq-Nice] implementation in my tree to not only > > change real-time tasks to SCHED_OTHER, but to also renice negative > > nice levels back to 0 - this will show up in -v6. That way you'd > > only have had to hit SysRq-N to get the system out of the wedge.) > > if you are trying to unwedge a system it may be a good idea to renice > all tasks to 0, it could be that a task at +19 is holding a lock that > something else is waiting for. Yeah, that's possible too, but +19 tasks are getting a small but guaranteed share of the CPU so eventually it ought to release it. It's still a possibility, but i think i'll wait for a specific incident to happen first, and then react to that incident :-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > yeah, i guess this has little to do with X. I think in your scenario > it might have been smarter to either stop, or to renice the workloads > that took away CPU power from others to _positive_ nice levels. > Negative nice levels can indeed be dangerous. btw., was X itself at nice 0 or nice -10 when the lockup happened? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tue, 24 Apr 2007, Ingo Molnar wrote: * Gene Heskett <[EMAIL PROTECTED]> wrote: Gene has done some testing under CFS with X reniced to +10 and the desktop still worked smoothly for him. As a data point here, and probably nothing to do with X, but I did manage to lock it up, solid, reset button time tonight, by wanting 'smart' to get done with an update session after amanda had started. I took both smart processes I could see in htop all the way to -19, but when it was about done about 3 minutes later, everything came to an instant, frozen, reset button required lockup. I should have stopped at -17 I guess. :( yeah, i guess this has little to do with X. I think in your scenario it might have been smarter to either stop, or to renice the workloads that took away CPU power from others to _positive_ nice levels. Negative nice levels can indeed be dangerous. (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) if you are trying to unwedge a system it may be a good idea to renice all tasks to 0, it could be that a task at +19 is holding a lock that something else is waiting for. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Gene Heskett <[EMAIL PROTECTED]> wrote: > > (Btw., to protect against such mishaps in the future i have changed > > the SysRq-N [SysRq-Nice] implementation in my tree to not only > > change real-time tasks to SCHED_OTHER, but to also renice negative > > nice levels back to 0 - this will show up in -v6. That way you'd > > only have had to hit SysRq-N to get the system out of the wedge.) > > That sounds handy, particularly with idiots like me at the wheel... by that standard i guess we tinkerers are all idiots ;) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: >* Gene Heskett <[EMAIL PROTECTED]> wrote: >> > Gene has done some testing under CFS with X reniced to +10 and the >> > desktop still worked smoothly for him. >> >> As a data point here, and probably nothing to do with X, but I did >> manage to lock it up, solid, reset button time tonight, by wanting >> 'smart' to get done with an update session after amanda had started. >> I took both smart processes I could see in htop all the way to -19, >> but when it was about done about 3 minutes later, everything came to >> an instant, frozen, reset button required lockup. I should have >> stopped at -17 I guess. :( > >yeah, i guess this has little to do with X. I think in your scenario it >might have been smarter to either stop, or to renice the workloads that >took away CPU power from others to _positive_ nice levels. Negative nice >levels can indeed be dangerous. > >(Btw., to protect against such mishaps in the future i have changed the >SysRq-N [SysRq-Nice] implementation in my tree to not only change >real-time tasks to SCHED_OTHER, but to also renice negative nice levels >back to 0 - this will show up in -v6. That way you'd only have had to >hit SysRq-N to get the system out of the wedge.) > > Ingo That sounds handy, particularly with idiots like me at the wheel... -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) When a Banker jumps out of a window, jump after him--that's where the money is. -- Robespierre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Gene Heskett <[EMAIL PROTECTED]> wrote: > > Gene has done some testing under CFS with X reniced to +10 and the > > desktop still worked smoothly for him. > > As a data point here, and probably nothing to do with X, but I did > manage to lock it up, solid, reset button time tonight, by wanting > 'smart' to get done with an update session after amanda had started. > I took both smart processes I could see in htop all the way to -19, > but when it was about done about 3 minutes later, everything came to > an instant, frozen, reset button required lockup. I should have > stopped at -17 I guess. :( yeah, i guess this has little to do with X. I think in your scenario it might have been smarter to either stop, or to renice the workloads that took away CPU power from others to _positive_ nice levels. Negative nice levels can indeed be dangerous. (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Ingo Molnar wrote: static void yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to) { struct rb_node *curr, *next, *first; struct task_struct *p_next; /* * yield-to support: if we are on the same runqueue then * give half of our wait_runtime (if it's positive) to the other task: */ if (p_to && p->wait_runtime > 0) { p->wait_runtime >>= 1; p_to->wait_runtime += p->wait_runtime; } the above is the basic expression of: "charge a positive bank balance". [..] [note, due to the nanoseconds unit there's no rounding loss to worry about.] Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss? Ingo Rogan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: >* Peter Williams <[EMAIL PROTECTED]> wrote: >> > The cases are fundamentally different in behavior, because in the >> > first case, X hardly consumes the time it would get in any scheme, >> > while in the second case X really is CPU bound and will happily >> > consume any CPU time it can get. >> >> Which still doesn't justify an elaborate "points" sharing scheme. >> Whichever way you look at that that's just another way of giving X >> more CPU bandwidth and there are simpler ways to give X more CPU if it >> needs it. However, I think there's something seriously wrong if it >> needs the -19 nice that I've heard mentioned. > >Gene has done some testing under CFS with X reniced to +10 and the >desktop still worked smoothly for him. As a data point here, and probably nothing to do with X, but I did manage to lock it up, solid, reset button time tonight, by wanting 'smart' to get done with an update session after amanda had started. I took both smart processes I could see in htop all the way to -19, but when it was about done about 3 minutes later, everything came to an instant, frozen, reset button required lockup. I should have stopped at -17 I guess. :( >So CFS does not 'need' a reniced >X. There are simply advantages to negative nice levels: for example >screen refreshes are smoother on any scheduler i tried. BUT, there is a >caveat: on non-CFS schedulers i tried X is much more prone to get into >'overscheduling' scenarios that visibly hurt X's performance, while on >CFS there's a max of 1000-1500 context switches a second at nice -10. >(which, considering the cost of a context switch is well under 1% >overhead.) > >So, my point is, the nice level of X for desktop users should not be set >lower than a low limit suggested by that particular scheduler's author. >That limit is scheduler-specific. Con i think recommends a nice level of >-1 for X when using SD [Con, can you confirm?], while my tests show that >if you want you can go as low as -10 under CFS, without any bad >side-effects. (-19 was a bit too much) > >> [...] You might as well just run it as a real time process. > >hm, that would be a bad idea under any scheduler (including CFS), >because real time processes can starve other processes indefinitely. > > Ingo -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) I have discovered that all human evil comes from this, man's being unable to sit still in a room. -- Blaise Pascal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Peter Williams <[EMAIL PROTECTED]> wrote: > > The cases are fundamentally different in behavior, because in the > > first case, X hardly consumes the time it would get in any scheme, > > while in the second case X really is CPU bound and will happily > > consume any CPU time it can get. > > Which still doesn't justify an elaborate "points" sharing scheme. > Whichever way you look at that that's just another way of giving X > more CPU bandwidth and there are simpler ways to give X more CPU if it > needs it. However, I think there's something seriously wrong if it > needs the -19 nice that I've heard mentioned. Gene has done some testing under CFS with X reniced to +10 and the desktop still worked smoothly for him. So CFS does not 'need' a reniced X. There are simply advantages to negative nice levels: for example screen refreshes are smoother on any scheduler i tried. BUT, there is a caveat: on non-CFS schedulers i tried X is much more prone to get into 'overscheduling' scenarios that visibly hurt X's performance, while on CFS there's a max of 1000-1500 context switches a second at nice -10. (which, considering the cost of a context switch is well under 1% overhead.) So, my point is, the nice level of X for desktop users should not be set lower than a low limit suggested by that particular scheduler's author. That limit is scheduler-specific. Con i think recommends a nice level of -1 for X when using SD [Con, can you confirm?], while my tests show that if you want you can go as low as -10 under CFS, without any bad side-effects. (-19 was a bit too much) > [...] You might as well just run it as a real time process. hm, that would be a bad idea under any scheduler (including CFS), because real time processes can starve other processes indefinitely. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Arjan van de Ven wrote: Within reason, it's not the number of clients that X has that causes its CPU bandwidth use to sky rocket and cause problems. It's more to to with what type of clients they are. Most GUIs (even ones that are constantly updating visual data (e.g. gkrellm -- I can open quite a large number of these without increasing X's CPU usage very much)) cause very little load on the X server. The exceptions to this are the there is actually 2 and not just 1 "X server", and they are VERY VERY different in behavior. Case 1: Accelerated driver If X talks to a decent enough card it supports will with acceleration, it will be very rare for X itself to spend any kind of significant amount of CPU time, all the really heavy stuff is done in hardware, and asynchronously at that. A bit of batching will greatly improve system performance in this case. Case 2: Unaccelerated VESA Some drivers in X, especially the VESA and NV drivers (which are quite common, vesa is used on all hardware without a special driver nowadays), have no or not enough acceleration to matter for modern desktops. This means the CPU is doing all the heavy lifting, in the X program. In this case even a simple "move the window a bit" becomes quite a bit of a CPU hog already. Mine's a: SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according to X's display settings tool. Which category does that fall into? It's not a special adapter and is just the one that came with the motherboard. It doesn't use much CPU unless I grab a window and wiggle it all over the screen or do something like "ls -lR /" in an xterm. The cases are fundamentally different in behavior, because in the first case, X hardly consumes the time it would get in any scheme, while in the second case X really is CPU bound and will happily consume any CPU time it can get. Which still doesn't justify an elaborate "points" sharing scheme. Whichever way you look at that that's just another way of giving X more CPU bandwidth and there are simpler ways to give X more CPU if it needs it. However, I think there's something seriously wrong if it needs the -19 nice that I've heard mentioned. You might as well just run it as a real time process. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
> Within reason, it's not the number of clients that X has that causes its > CPU bandwidth use to sky rocket and cause problems. It's more to to > with what type of clients they are. Most GUIs (even ones that are > constantly updating visual data (e.g. gkrellm -- I can open quite a > large number of these without increasing X's CPU usage very much)) cause > very little load on the X server. The exceptions to this are the there is actually 2 and not just 1 "X server", and they are VERY VERY different in behavior. Case 1: Accelerated driver If X talks to a decent enough card it supports will with acceleration, it will be very rare for X itself to spend any kind of significant amount of CPU time, all the really heavy stuff is done in hardware, and asynchronously at that. A bit of batching will greatly improve system performance in this case. Case 2: Unaccelerated VESA Some drivers in X, especially the VESA and NV drivers (which are quite common, vesa is used on all hardware without a special driver nowadays), have no or not enough acceleration to matter for modern desktops. This means the CPU is doing all the heavy lifting, in the X program. In this case even a simple "move the window a bit" becomes quite a bit of a CPU hog already. The cases are fundamentally different in behavior, because in the first case, X hardly consumes the time it would get in any scheme, while in the second case X really is CPU bound and will happily consume any CPU time it can get. -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Linus Torvalds wrote: On Mon, 23 Apr 2007, Ingo Molnar wrote: The "give scheduler money" transaction can be both an "implicit transaction" (for example when writing to UNIX domain sockets or blocking on a pipe, etc.), or it could be an "explicit transaction": sched_yield_to(). This latter i've already implemented for CFS, but it's much less useful than the really significant implicit ones, the ones which will help X. Yes. It would be wonderful to get it working automatically, so please say something about the implementation.. The "perfect" situation would be that when somebody goes to sleep, any extra points it had could be given to whoever it woke up last. Note that for something like X, it means that the points are 100% ephemeral: it gets points when a client sends it a request, but it would *lose* the points again when it sends the reply! So it would only accumulate "scheduling points" while multiuple clients are actively waiting for it, which actually sounds like exactly the right thing. However, I don't really see how to do it well, especially since the kernel cannot actually match up the client that gave some scheduling points to the reply that X sends back. There are subtle semantics with these kinds of things: especially if the scheduling points are only awarded when a process goes to sleep, if X is busy and continues to use the CPU (for another client), it wouldn't give any scheduling points back to clients and they really do accumulate with the server. Which again sounds like it would be exactly the right thing (both in the sense that the server that runs more gets more points, but also in the sense that we *only* give points at actual scheduling events). But how do you actually *give/track* points? A simple "last woken up by this process" thing that triggers when it goes to sleep? It might work, but on the other hand, especially with more complex things (and networking tends to be pretty complex) the actual wakeup may be done by a software irq. Do we just say "it ran within the context of X, so we assume X was the one that caused it?" It probably would work, but we've generally tried very hard to avoid accessing "current" from interrupt context, including bh's. Within reason, it's not the number of clients that X has that causes its CPU bandwidth use to sky rocket and cause problems. It's more to to with what type of clients they are. Most GUIs (even ones that are constantly updating visual data (e.g. gkrellm -- I can open quite a large number of these without increasing X's CPU usage very much)) cause very little load on the X server. The exceptions to this are the various terminal emulators (e.g. xterm, gnome-terminal, etc.) when being used to run output intensive command line programs e.g. try "ls -lR /" in an xterm. The other way (that I've noticed) X's CPU usage bandwidth sky rocket is when you grab a large window and wiggle it about a lot and hopefully this doesn't happen a lot so the problem that needs to be addressed is the one caused by text output on xterm and its ilk. So I think that an elaborate scheme for distributing "points" between X and its clients would be overkill. A good scheduler will make sure other tasks such as audio streamers get CPU when they need it with good responsiveness even when X takes off by giving them higher priority because their CPU bandwidth use is low. The one problem that might still be apparent in these cases is the mouse becoming jerky while X is working like crazy to spew out text too fast for anyone to read. But the only way to fix that is to give X more bandwidth but if it's already running at about 95% of a CPU that's unlikely to help. To fix this you would probably need to modify X so that it knows re-rendering the cursor is more important than rendering text in an xterm. In normal circumstances, the re-rendering of the mouse happens quickly enough for the user to experience good responsiveness because X's normal CPU use is low enough for it to be given high priority. Just because the O(1) tried this model and failed doesn't mean that the model is bad. O(1) was a flawed implementation of a good model. Peter PS Doing a kernel build in an xterm isn't an example of high enough output to cause a problem as (on my system) it only raises X's consumption from 0 to 2% to 2 to 5%. The type of output that causes the problem is usually flying past too fast to read. -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote: > I don't know if we've discussed this or not. Since both CFS and SD claim > to be fair, I'd like to hear more opinions on the fairness aspect of > these designs. In areas such as OS, networking, and real-time, fairness, > and its more general form, proportional fairness, are well-defined > terms. In fact, perfect fairness is not feasible since it requires all > runnable threads to be running simultaneously and scheduled with > infinitesimally small quanta (like a fluid system). So to evaluate if a Unfortunately, fairness is rather non-formal in this context and probably isn't strictly desirable given how hack much of Linux userspace is. Until there's a method of doing directed yields, like what Will has prescribed a kind of allotment to thread doing work for another a completely strict mechanism, it is probably problematic with regards to corner cases. X for example is largely non-thread safe. Until they can get their xcb framework in place and addition thread infrastructure to do hand off properly, it's going to be difficult schedule for it. It's well known to be problematic. You announced your scheduler without CCing any of the relevant people here (and risk being completely ignored in lkml traffic): http://lkml.org/lkml/2007/4/20/286 What is your opinion of both CFS and SDL ? How can you work be useful to either scheduler mentioned or to the Linux kernel on its own ? > I understand that via experiments we can show a design is reasonably > fair in the common case, but IMHO, to claim that a design is fair, there > needs to be some kind of formal analysis on the fairness bound, and this > bound should be proven to be constant. Even if the bound is not > constant, at least this analysis can help us better understand and > predict the degree of fairness that users would experience (e.g., would > the system be less fair if the number of threads increases? What happens > if a large number of threads dynamically join and leave the system?). Will has been thinking about this, but you have to also consider the practicalities of your approach versus Con's and Ingo's. I'm all for things like proportional scheduling and the extensions needed to do it properly. It would be highly relevant to some version of the -rt patch if not that patch directly. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [REPORT] cfs-v4 vs sd-0.44
I don't know if we've discussed this or not. Since both CFS and SD claim to be fair, I'd like to hear more opinions on the fairness aspect of these designs. In areas such as OS, networking, and real-time, fairness, and its more general form, proportional fairness, are well-defined terms. In fact, perfect fairness is not feasible since it requires all runnable threads to be running simultaneously and scheduled with infinitesimally small quanta (like a fluid system). So to evaluate if a new scheduling algorithm is fair, the common approach is to take the ideal fair algorithm (often referred to as Generalized Processor Scheduling or GPS) as a reference model and analyze if the new algorithm can achieve a constant error bound (different error metrics also exist). I understand that via experiments we can show a design is reasonably fair in the common case, but IMHO, to claim that a design is fair, there needs to be some kind of formal analysis on the fairness bound, and this bound should be proven to be constant. Even if the bound is not constant, at least this analysis can help us better understand and predict the degree of fairness that users would experience (e.g., would the system be less fair if the number of threads increases? What happens if a large number of threads dynamically join and leave the system?). tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Linus Torvalds wrote: > The "perfect" situation would be that when somebody goes to sleep, any > extra points it had could be given to whoever it woke up last. Note that > for something like X, it means that the points are 100% ephemeral: it gets > points when a client sends it a request, but it would *lose* the points > again when it sends the reply! > > So it would only accumulate "scheduling points" while multiuple clients > are actively waiting for it, which actually sounds like exactly the right > thing. However, I don't really see how to do it well, especially since the > kernel cannot actually match up the client that gave some scheduling > points to the reply that X sends back. > This works out in quite an interesting way. If the economy is closed - all clients and servers are managed by the same scheduler - then the server could get no inherent CPU priority and live entirely on donated shares. If that were the case, you'd have to make sure that the server used the donation from client A on client A's work, otherwise you'd get freeloaders - but maybe it will all work out. It gets more interesting when you have a non-closed system - the X server is working on behalf of external clients over TCP. Presumably wakeups from incoming TCP connections wouldn't have any scheduler shares associated with it, so the X server would have to use its inherent CPU allocation to service those requests. Or the external client could effectively end up freeloading off portions of the local clients' donations. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
2007/4/23, Ingo Molnar <[EMAIL PROTECTED]>: p->wait_runtime >>= 1; p_to->wait_runtime += p->wait_runtime; I have no problem with clients giving some credit to X, I am more concerned with X giving half of its credit to a single client, a quarter of its credit to another client, etc... For example, a client could setup a periodical wake up from X, and then periodically getting some credit for free. Would that be possible? Thanks. -- Guillaume - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > sorry, i was a bit imprecise here. There is a case where CFS can give > out a 'loan' to tasks. The scheduler tick has a low resolution, so it > is fundamentally inevitable [*] that tasks will run a bit more than > they should, and at a heavy context-switching rates these errors can > add up significantly. Furthermore, we want to batch up workloads. > > So CFS has a "no loans larger than sched_granularity_ns" policy (which > defaults to 5msec), and it captures these sub-granularity 'loans' with > nanosec accounting. This too is a very sane economic policy and is > anti-infationary :-) at which point i guess i should rename CFS to 'EFS' (the Economic Fair Scheduler)? =B-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > (we obviously dont want to allow people to 'share' their loans with > others ;), nor do we want to allow a net negative balance. CFS is > really brutally cold-hearted, it has a strict 'no loans' policy - the > easiest economic way to manage 'inflation', besides the basic act of > not printing new money, ever.) sorry, i was a bit imprecise here. There is a case where CFS can give out a 'loan' to tasks. The scheduler tick has a low resolution, so it is fundamentally inevitable [*] that tasks will run a bit more than they should, and at a heavy context-switching rates these errors can add up significantly. Furthermore, we want to batch up workloads. So CFS has a "no loans larger than sched_granularity_ns" policy (which defaults to 5msec), and it captures these sub-granularity 'loans' with nanosec accounting. This too is a very sane economic policy and is anti-infationary :-) Ingo [*] i fundamentally hate 'fundamentally inevitable' conditions so i have plans to make the scheduler tick be fed from the rbtree and thus become a true high-resolution timer. This not only increases fairness (=='precision of scheduling') more, but it also decreases the number of timer interrupts on a running system - extending dynticks to sched-ticks too. Thomas and me shaped dynticks to enable that in an easy way: the scheduler tick is today already a high-res timer (but which is currently still driven via the jiffy mechanism). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > > The "give scheduler money" transaction can be both an "implicit > > transaction" (for example when writing to UNIX domain sockets or > > blocking on a pipe, etc.), or it could be an "explicit transaction": > > sched_yield_to(). This latter i've already implemented for CFS, but > > it's much less useful than the really significant implicit ones, the > > ones which will help X. > > Yes. It would be wonderful to get it working automatically, so please > say something about the implementation.. i agree that the devil will be in the details, but so far it's really simple. I'll put all this into separate helper functions so that places can just use it in a natural way. The existing yield-to bit is this: static void yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to) { struct rb_node *curr, *next, *first; struct task_struct *p_next; /* * yield-to support: if we are on the same runqueue then * give half of our wait_runtime (if it's positive) to the other task: */ if (p_to && p->wait_runtime > 0) { p->wait_runtime >>= 1; p_to->wait_runtime += p->wait_runtime; } the above is the basic expression of: "charge a positive bank balance". (we obviously dont want to allow people to 'share' their loans with others ;), nor do we want to allow a net negative balance. CFS is really brutally cold-hearted, it has a strict 'no loans' policy - the easiest economic way to manage 'inflation', besides the basic act of not printing new money, ever.) [note, due to the nanoseconds unit there's no rounding loss to worry about.] that's all. No runqueue locking, no wakeup decisions even! [Note: see detail #1 below for cases where we need to touch the tree]. Really low-overhead. Accumulated 'new money' will be acted upon in the next schedule() call or in the next scheduler tick, whichever comes sooner. Note that in most cases when tasks communicate there will be a natural schedule() anyway, which drives this. p->wait_runtime is also very finegrained: it is in nanoseconds, so a task can 'pay' at arbitrary granularity in essence, and there is in essence zero 'small coin overhead' and 'conversion loss' in this money system. (as you might remember, sharing p->timeslice had inherent rounding and sharing problems due to its low jiffy resolution) detail #1: for decoupled workloads where there is no direct sleep/wake coupling between worker and producer, there should also be a way to update a task's position in the fairness tree, if it accumulates significant amount of new p->wait_runtime. I think this can be done by making this an extra field: p->new_wait_runtime, which gets picked up by the task if it runs, or which gets propagated into the task's tree position if the p->new_wait_runtime value goes above the sched_granularity_ns value. But it would work pretty well even without this, the server will take advantage of the p->new_wait_runtime immediately when it runs, so as long as enough clients 'feed' it with money, it will always have enough to keep going. detail #2: changes to p->wait_runtime are totally lockless, as long as they are 64-bit atomic. So the above code is a bit naive on 32-bit systems, but no locking is needed otherwise, other than having a stable reference to a task structure. (i designed CFS for 64-bit systems) detail #3: i suspect i should rename p->wait_runtime to a more intuitive name - perhaps p->right_to_run? I want to avoid calling it p->timeslice because it's not really a timeslice, it's the thing you earned, the 'timeslice' is a totally locally decided property that has no direct connection to this physical resource. I also dont want to call it p->cpu_credit, because it is _not_ a credit system: every positive value there has been earned the hard way: by 'working' for the system via waiting on the runqueue - scaled down to the 'expected fair runtime' - i.e. roughly scaled down by 1/rq->nr_running. detail #3: the scheduler is also a charity: when it has no other work left it will let tasks execute "for free" ;-) But otherwise, in any sort of saturated market situation CFS is very much a cold hearted capitalist. about the 50% rule: it was a totally arbitrary case for yield_to(), and in other cases it should rather be: "give me _all_ the money you have, i'll make it work for you as much as i can". And the receiver should also perhaps record the amount of 'money' it got from the client, and _give back_ any unused proportion of it. (only where easily doable, in 1:1 task relationships) I.e.: p_to->wait_runtime = p->wait_runtime; p->wait_runtime = 0; schedule(); the former two lines put into a sched_pay(p) API perhaps? > The "perfect" situation would be that when somebody goes to sleep, any > extra points it had could be given to whoever it woke up last. Note
Re: [REPORT] cfs-v4 vs sd-0.44
On Mon, 23 Apr 2007, Ingo Molnar wrote: > > The "give scheduler money" transaction can be both an "implicit > transaction" (for example when writing to UNIX domain sockets or > blocking on a pipe, etc.), or it could be an "explicit transaction": > sched_yield_to(). This latter i've already implemented for CFS, but it's > much less useful than the really significant implicit ones, the ones > which will help X. Yes. It would be wonderful to get it working automatically, so please say something about the implementation.. The "perfect" situation would be that when somebody goes to sleep, any extra points it had could be given to whoever it woke up last. Note that for something like X, it means that the points are 100% ephemeral: it gets points when a client sends it a request, but it would *lose* the points again when it sends the reply! So it would only accumulate "scheduling points" while multiuple clients are actively waiting for it, which actually sounds like exactly the right thing. However, I don't really see how to do it well, especially since the kernel cannot actually match up the client that gave some scheduling points to the reply that X sends back. There are subtle semantics with these kinds of things: especially if the scheduling points are only awarded when a process goes to sleep, if X is busy and continues to use the CPU (for another client), it wouldn't give any scheduling points back to clients and they really do accumulate with the server. Which again sounds like it would be exactly the right thing (both in the sense that the server that runs more gets more points, but also in the sense that we *only* give points at actual scheduling events). But how do you actually *give/track* points? A simple "last woken up by this process" thing that triggers when it goes to sleep? It might work, but on the other hand, especially with more complex things (and networking tends to be pretty complex) the actual wakeup may be done by a software irq. Do we just say "it ran within the context of X, so we assume X was the one that caused it?" It probably would work, but we've generally tried very hard to avoid accessing "current" from interrupt context, including bh's.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Hi ! On Mon, Apr 23, 2007 at 09:11:43PM +0200, Ingo Molnar wrote: > > * Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > but the point I'm trying to make is that X shouldn't get more CPU-time > > because it's "more important" (it's not: and as noted earlier, > > thinking that it's more important skews the problem and makes for too > > *much* scheduling). X should get more CPU time simply because it > > should get it's "fair CPU share" relative to the *sum* of the clients, > > not relative to any client individually. > > yeah. And this is not a pipe dream and i think it does not need a > 'wakeup matrix' or other complexities. > > I am --->.< this close to being able to do this very robustly under > CFS via simple rules of economy and trade: there the p->wait_runtime > metric is intentionally a "physical resource" of "hard-earned right to > execute on the CPU, by having waited on it" the sum of which is bound > for the whole system. > > So while with other, heuristic approaches we always had the problem of > creating a "hyper-inflation" of an uneconomic virtual currency that > could be freely printed by certain tasks, in CFS the economy of this is > strict and the finegrained plus/minus balance is strictly managed by a > conservative and independent central bank. > > So we can actually let tasks "trade" in these very physical units of > "right to execute on the CPU". A task giving it to another task means > that this task _already gave up CPU time in the past_. So it's the > robust equivalent of an economy's "money earned" concept, and this > "money"'s distribution (and redistribution) is totally fair and totally > balanced and is not prone to "inflation". > > The "give scheduler money" transaction can be both an "implicit > transaction" (for example when writing to UNIX domain sockets or > blocking on a pipe, etc.), or it could be an "explicit transaction": > sched_yield_to(). This latter i've already implemented for CFS, but it's > much less useful than the really significant implicit ones, the ones > which will help X. I don't think that a task should _give_ its slice to the task it's waiting on, but it should _lend_ it : if the second task (the server, eg: X) does not eat everything, maybe the first one will need to use the remains. We had a good example with glxgears. Glxgears may need more CPU than X on some machines, less on others. But it needs CPU. So it must not give all it has to X otherwise it will stop. But it can tell X "hey, if you need some CPU, I have some here, help yourself". When X has exhausted its slice, it can then use some from the client. Hmmm no, better, X first serves itself in the client's share, and may then use (parts of) its own if it needs more. This would be seen like some CPU ressource buckets. Indeed, it's even a problem of economy as you said. If you want someone to do something for you, either it's very quick and simple and he can do it for free once in a while, or you take all of his time and you have to pay him for this time. Of course, you don't always know whom X is working for, and this will cause X to sometimes run for one task on another one's ressources. But as long as the work is done, it's OK. Hey, after all, many of us sometimes work for customers in real life and take some of their time to work on the kernel and everyone is happy with it. I think that if we could have a (small) list of CPU buckets per task, it would permit us to do such a thing. We would then have to ensure that pipes or unix sockets correctly present their buckets to their servers. If we consider that each task only has its own bucket and can lend it to one and only one server at a time, it should not look too much horrible. Basically, just something like this (thinking while typing) : struct task_struct { ... struct { struct list list; int money_left; } cpu_bucket; ... } Then, waking up another process would consist in linking our bucket into its own bucket list. The server can identify the task it's borrowing from by looking at which task_struct the list belongs to. Also, it creates some inheritance between processes. When doing such a thing : $ fgrep DST=1.2.3.4 fw.log | sed -e 's/1.2/A.B/' | gzip -c3 >fw-anon.gz Then fgrep would lend some CPU to sed which in turn would present them both to gzip. Maybe we need two lists in order of the structures to be unstacked upon gzip's sleep() :-/ I don't know if I'm clear enough. Cheers, Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > but the point I'm trying to make is that X shouldn't get more CPU-time > because it's "more important" (it's not: and as noted earlier, > thinking that it's more important skews the problem and makes for too > *much* scheduling). X should get more CPU time simply because it > should get it's "fair CPU share" relative to the *sum* of the clients, > not relative to any client individually. yeah. And this is not a pipe dream and i think it does not need a 'wakeup matrix' or other complexities. I am --->.< this close to being able to do this very robustly under CFS via simple rules of economy and trade: there the p->wait_runtime metric is intentionally a "physical resource" of "hard-earned right to execute on the CPU, by having waited on it" the sum of which is bound for the whole system. So while with other, heuristic approaches we always had the problem of creating a "hyper-inflation" of an uneconomic virtual currency that could be freely printed by certain tasks, in CFS the economy of this is strict and the finegrained plus/minus balance is strictly managed by a conservative and independent central bank. So we can actually let tasks "trade" in these very physical units of "right to execute on the CPU". A task giving it to another task means that this task _already gave up CPU time in the past_. So it's the robust equivalent of an economy's "money earned" concept, and this "money"'s distribution (and redistribution) is totally fair and totally balanced and is not prone to "inflation". The "give scheduler money" transaction can be both an "implicit transaction" (for example when writing to UNIX domain sockets or blocking on a pipe, etc.), or it could be an "explicit transaction": sched_yield_to(). This latter i've already implemented for CFS, but it's much less useful than the really significant implicit ones, the ones which will help X. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Mon, 23 Apr 2007, Nick Piggin wrote: > > If you have a single client, the X server is *not* more important than the > > client, and indeed, renicing the X server causes bad patterns: just > > because the client sends a request does not mean that the X server should > > immediately be given the CPU as being "more important". > > If the client is doing some processing, and the user moves the mouse, it > feels much more interactive if the pointer moves rather than waits for > the client to finish processing. .. yes. However, that should be automatically true if the X process just has "enough CPU time" to merit being scheduled to. Which it normally should always have, exactly because it's an "interactive" process (regardless of how the scheduler is done - any scheduler should always give sleepers good latency. The current one obviously does it by giving interactivity-bonuses, CFS does it by trying to be fair in giving out CPU time). The problem tends to be the following scenario: - the X server is very CPU-busy, because it has lots of clients connecting to it, and it's not getting any "bonus" for doing work for those clients (ie it uses up its time-slice and thus becomes "less important" than other processes, since it's already gotten its "fair" slice of CPU - never mind that it was really unfair to not give it more) - there is some process that is *not* dependent on X, that can (and does) run, because X has spent its CPU time serving others. but the point I'm trying to make is that X shouldn't get more CPU-time because it's "more important" (it's not: and as noted earlier, thinking that it's more important skews the problem and makes for too *much* scheduling). X should get more CPU time simply because it should get it's "fair CPU share" relative to the *sum* of the clients, not relative to any client individually. Once you actually do give the X server "fair share" of the CPU, I'm sure that you can still get into bad situations (trivial example: make clients that on purpose do X requests that are expensive for the server, but are cheap to generate). But it's likely not going to be an issue in practice any more. Scheduling is not something you can do "perfectly". There's no point in even trying. To do "perfect" scheduling, you'd have to have ESP and know exactly what the user expects and know the future too. What you should aim for is the "obvious cases". And I don't think anybody really disputes the fact that a process that does work for other processes "obviously" should get the CPU time skewed towards it (and away from the clients - not from non-clients!). I think the only real issue is that nobody really knows how to do it well (or at all). I think the "schedule by user" would be reasonable in practice - not perfect by any means, but it *does* fall into the same class of issues: users are not in general "more important" than other users, but they should be treated fairly across the user, not on a per-process basis. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sun, Apr 22, 2007 at 04:24:47PM -0700, Linus Torvalds wrote: > > > On Sun, 22 Apr 2007, Juliusz Chroboczek wrote: > > > > Why not do it in the X server itself? This will avoid controversial > > policy in the kernel, and have the added advantage of working with > > X servers that don't directly access hardware. > > It's wrong *wherever* you do it. > > The X server should not be re-niced. It was done in the past, and it was > wrogn then (and caused problems - we had to tell people to undo it, > because some distros had started doing it by default). The 2.6 scheduler can get very bad latency problems with the X server reniced. > If you have a single client, the X server is *not* more important than the > client, and indeed, renicing the X server causes bad patterns: just > because the client sends a request does not mean that the X server should > immediately be given the CPU as being "more important". If the client is doing some processing, and the user moves the mouse, it feels much more interactive if the pointer moves rather than waits for the client to finish processing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sun, 2007-04-22 at 09:16 -0700, Ulrich Drepper wrote: > On 4/22/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: > > On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote: > > > For futex(), the extension is needed for the FUTEX_WAIT operation. We > > > need a new operation FUTEX_WAIT_FOR or so which takes another (the > > > fourth) parameter which is the PID of the target. > > > For FUTEX_LOCK_PI we need no extension. The futex value is the PID of > > > the current owner. This is required for the whole interface to work > > > in the first place. > > > > We'll have to send things out and see what sticks here. There seems to > > be some pickiness above. > > I know Rusty will shudder since it makes futexes yet more complicated > (although only if the user wants it) but if you introduce the concept > of "yield to" then this extension makes really sense and it is a quite > simple extension. Plus: I'm the most affected by the change since I > have to change code to use it and I'm fine with it. Hi Uli, I wouldn't worry: futexes long ago jumped the shark. I think it was inevitable that once we started endorsing programs bypassing the kernel for IPC that we'd want some form of yield_to(). And yield_to(p) has much more sane semantics than yield(). Cheers, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sun, 22 Apr 2007, Juliusz Chroboczek wrote: > > Why not do it in the X server itself? This will avoid controversial > policy in the kernel, and have the added advantage of working with > X servers that don't directly access hardware. It's wrong *wherever* you do it. The X server should not be re-niced. It was done in the past, and it was wrogn then (and caused problems - we had to tell people to undo it, because some distros had started doing it by default). If you have a single client, the X server is *not* more important than the client, and indeed, renicing the X server causes bad patterns: just because the client sends a request does not mean that the X server should immediately be given the CPU as being "more important". In other words, the things that make it important that the X server _can_ get CPU time if needed are all totally different from the X server being "more important". The X server is more important only in the presense of multiple clients, not on its own! Needing to renice it is a hack for a bad scheduler, and shows that somebody doesn't understand the problem! Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
> Oh I definitely was not advocating against renicing X, Why not do it in the X server itself? This will avoid controversial policy in the kernel, and have the added advantage of working with X servers that don't directly access hardware. Con, if you tell me ``if you're running under Linux and such and such /sys variable has value so-and-so, then it's definitely a good idea to call nice(42) at the X server's start up'', then I'll commit it into X.Org. (Please CC both me the list, so I can point any people complaining to the archives.) Juliusz - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On 4/22/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote: > For futex(), the extension is needed for the FUTEX_WAIT operation. We > need a new operation FUTEX_WAIT_FOR or so which takes another (the > fourth) parameter which is the PID of the target. > For FUTEX_LOCK_PI we need no extension. The futex value is the PID of > the current owner. This is required for the whole interface to work > in the first place. We'll have to send things out and see what sticks here. There seems to be some pickiness above. I know Rusty will shudder since it makes futexes yet more complicated (although only if the user wants it) but if you introduce the concept of "yield to" then this extension makes really sense and it is a quite simple extension. Plus: I'm the most affected by the change since I have to change code to use it and I'm fine with it. Oh, last time I didn't explicitly mention the cases of waitpid()/wait4()/waitid() explicitly naming a process to wait on. I think it's clear that those cases also should be changed to use yield to if possible. I don't have a good suggestion what to do when the call waits for any child. Perhaps yielding to the last created one is fine. If delays through reading on a pipe are recognized as well and handle with yield to then the time slot will automatically be forwarded to the first runnable process in the pipe sequence. I.e., running grep foo /etc/passwd | cut -d: -f2 | crack probably will create 'crack' last. Giving the remainder of the time slot should result is recognizing it waits for 'cut' which in turn waits for 'grep'. So in the end 'grep' gets the timeslot. Seems quite complicated from the outside but I can imagine quite good results from this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Mark Lord <[EMAIL PROTECTED]> wrote: > > i've not experienced a 'runaway X' personally, at most it would > > crash or lock up ;) The value is boot-time and sysctl configurable > > as well back to 0. > > Mmmm.. I've had to kill off the odd X that was locking in 100% CPU > usage. In the past, this has happened maybe 1-3 times a year or so on > my notebook. > > Now mind you, that usage could have been due to some client process, > but X is where the 100% showed up, so X is what I nuked. well, i just simulated a runaway X at nice -19 on CFS (on a UP box), and while the box was a tad laggy, i was able to killall it without problems, within 2 seconds that also included a 'su'. So it's not an issue in CFS, it can be turned off, and because every distro has another way to renice Xorg, this is a convenience hack until Xorg standardizes it into some xorg.conf field. (It also makes sure that X isnt preempted by other userspace stuff while it does timing-sensitive operations like setting the video modes up or switching video modes, etc.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Ingo Molnar wrote: well, i just simulated a runaway X at nice -19 on CFS (on a UP box), and while the box was a tad laggy, i was able to killall it without problems, within 2 seconds that also included a 'su'. So it's not an issue in CFS, it can be turned off, and because every distro has another way to renice Xorg, this is a convenience hack until Xorg standardizes it into some xorg.conf field. (It also makes sure that X isnt preempted by other userspace stuff while it does timing-sensitive operations like setting the video modes up or switching video modes, etc.) Good! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Ingo Molnar wrote: * Jan Engelhardt <[EMAIL PROTECTED]> wrote: i've attached it below in a standalone form, feel free to put it into SD! :) Assume X went crazy (lacking any statistics, I make the unproven statement that this happens more often than kthreads going berserk), then having it niced with minus something is not too nice. i've not experienced a 'runaway X' personally, at most it would crash or lock up ;) The value is boot-time and sysctl configurable as well back to 0. Mmmm.. I've had to kill off the odd X that was locking in 100% CPU usage. In the past, this has happened maybe 1-3 times a year or so on my notebook. Now mind you, that usage could have been due to some client process, but X is where the 100% showed up, so X is what I nuked. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Con Kolivas wrote: Oh I definitely was not advocating against renicing X, I just suspect that virtually all the users who gave glowing reports to CFS comparing it to SD had no idea it had reniced X to -19 behind their back and that they were comparing it to SD running X at nice 0. I really do wish I wouldn't feel the need to keep stepping in here to manually exclude my own results from such wide brush strokes. I'm one of those "users", and I've never even tried CFS v4 (yes). All prior versions did NOT do the renicing. The renicing was in the CFS v4 announcement, right up front for all to see, and the code for it has been posted separately with encouragement for RSDL or whatever to also adopt it. Now, with it in all of the various "me-too" schedulers, maybe they'll all start to shine a little more on real users' systems. So far, the stock 2.6.20 scheduler remains my own current preference, despite really good results with CFS v1. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sunday 22 April 2007, William Lee Irwin III wrote: >On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote: >> CFS-v4 is quite smooth in terms of the users experience but after >> prolonged observations approaching 24 hours, it appears to choke the cpu >> hog off a bit even when the system has nothing else to do. My amanda runs >> went from 1 to 1.5 hours depending on how much time it took gzip to handle >> the amount of data tar handed it, up to about 165m & change, or nearly 3 >> hours pretty consistently over 5 runs. > >Welcome to infinite history. I'm not surprised, apart from the time >scale of anomalies being much larger than I anticipated. [...] >Pardon my saying so but you appear to be describing anomalous behavior >in terms of "scheduler warmups." Well, that was what I saw, it took gzip about 4 or 5 minutes to get to the first 90% hit in htop's display, and it first hit the top of the display with only 5%. And the next backup run took about 2h:21m, so we're back in the ballpark. I'd reset amanda's schedule for a faster dumpcycle too, along with giving the old girl a new drive, all about the time we started playing with this, so the times I'm recording now may well be nominal. I suppose I should boot a plain 2.6.21-rc7 and make a run & time that, but I don't enjoy masochism THAT much. :) -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) I've enjoyed just about as much of this as I can stand. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On 4/22/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> I'm just looking for what people want the API to be here. With that in >> hand we can just go out and do whatever needs to be done. On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote: > I think a sched_yield_to is one interface: > int sched_yield_to(pid_t); All clear on that front. On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote: > For futex(), the extension is needed for the FUTEX_WAIT operation. We > need a new operation FUTEX_WAIT_FOR or so which takes another (the > fourth) parameter which is the PID of the target. > For FUTEX_LOCK_PI we need no extension. The futex value is the PID of > the current owner. This is required for the whole interface to work > in the first place. We'll have to send things out and see what sticks here. There seems to be some pickiness above. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote: > CFS-v4 is quite smooth in terms of the users experience but after prolonged > observations approaching 24 hours, it appears to choke the cpu hog off a bit > even when the system has nothing else to do. My amanda runs went from 1 to > 1.5 hours depending on how much time it took gzip to handle the amount of > data tar handed it, up to about 165m & change, or nearly 3 hours pretty > consistently over 5 runs. Welcome to infinite history. I'm not surprised, apart from the time scale of anomalies being much larger than I anticipated. On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote: > sd-0.44 so far seems to be handling the same load (theres a backup running > right now) fairly well also, and possibly theres a bit more snap to the > system now. A switch to screen 1 from this screen 8, and the loading of that > screen image, which is the Cassini shot of saturn from the backside, the one > showing that teeny dot to the left of Saturn that is actually us, took 10 > seconds with the stock 2.6.21-rc7, 3 seconds with the best of Ingo's patches, > and now with Con's latest, is 1 second flat. Another screen however is 4 > seconds, so maybe that first scren had been looked at since I rebooted. > However, amanda is still getting estimates so gzip hasn't put a tiewrap > around the kernels neck just yet. Not sure what you mean by gzip putting a tiewrap around the kernel's neck. Could you clarify? On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote: > Some minutes later, gzip is smunching /usr/src, and the machine doesn't even > know its running as sd-0.44 isn't giving gzip more than 75% to gzip, and > probably averaging less than 50%. And it scared me a bit as it started out at > not over 5% for the first minute or so. Running in the 70's now according to > gkrellm, with an occasional blip to 95%. And the machine generally feels > good. I wonder what's behind that sort of initial and steady-state behavior. On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote: > I had previously given CFS-v4 a 95 score but that was before I saw > the general slowdown, and I believe my first impression of this one > is also a 95. This on a scale of the best one of the earlier CFS > patches being 100, and stock 2.6.21-rc7 gets a 0.0. This scheduler > seems to be giving gzip ever more cpu as time progresses, and the cpu > is warming up quite nicely, from about 132F idling to 149.9F now. > And my keyboard is still alive and well. > Generally speaking, Con, I believe this one is also a keeper. And we'll see > how long a backup run takes. Pardon my saying so but you appear to be describing anomalous behavior in terms of "scheduler warmups." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On 4/22/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: I'm just looking for what people want the API to be here. With that in hand we can just go out and do whatever needs to be done. I think a sched_yield_to is one interface: int sched_yield_to(pid_t); For futex(), the extension is needed for the FUTEX_WAIT operation. We need a new operation FUTEX_WAIT_FOR or so which takes another (the fourth) parameter which is the PID of the target. For FUTEX_LOCK_PI we need no extension. The futex value is the PID of the current owner. This is required for the whole interface to work in the first place. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On 4/21/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: >> And how the hell do you imagine you'd even *know* what thread holds the >> futex? On Sat, Apr 21, 2007 at 06:46:58PM -0700, Ulrich Drepper wrote: > We know this in most cases. This is information recorded, for > instance, in the mutex data structure. You might have missed my "the > interface must be extended" part. This means the PID of the owning > thread will have to be passed done. For PI mutexes this is not > necessary since the kernel already has access to the information. I'm just looking for what people want the API to be here. With that in hand we can just go out and do whatever needs to be done. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sun, 2007-04-22 at 10:08 +1000, Con Kolivas wrote: > On Sunday 22 April 2007 08:54, Denis Vlasenko wrote: > > On Saturday 21 April 2007 18:00, Ingo Molnar wrote: > > > correct. Note that Willy reniced X back to 0 so it had no relevance on > > > his test. Also note that i pointed this change out in the -v4 CFS > > > > > > announcement: > > > || Changes since -v3: > > > || > > > || - usability fix: automatic renicing of kernel threads such as > > > ||keventd, OOM tasks and tasks doing privileged hardware access > > > ||(such as Xorg). > > > > > > i've attached it below in a standalone form, feel free to put it into > > > SD! :) > > > > But X problems have nothing to do with "privileged hardware access". > > X problems are related to priority inversions between server and client > > processes, and "one server process - many client processes" case. > > It's not a privileged hardware access reason that this code is there. This is > obfuscation/advertising to make it look like there is a valid reason for X > getting negative nice levels somehow in the kernel to make interactive > testing of CFS better by default. That's not a very nice thing to say, and it has no benefit unless you specifically want to run multiple heavy X hitting clients. I boot with that feature disabled specifically to be able to measure fairness in a pure environment, and it's still _much_ smoother and snappier than any RSDL/SD kernel I ever tried. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Con Kolivas wrote: > On Sunday 22 April 2007 02:00, Ingo Molnar wrote: > > * Con Kolivas <[EMAIL PROTECTED]> wrote: > > > > Feels even better, mouse movements are very smooth even under high > > > > load. I noticed that X gets reniced to -19 with this scheduler. > > > > I've not looked at the code yet but this looked suspicious to me. > > > > I've reniced it to 0 and it did not change any behaviour. Still > > > > very good. > > > > > > Looks like this code does it: > > > > > > +int sysctl_sched_privileged_nice_level __read_mostly = -19; > > > > correct. > > Oh I definitely was not advocating against renicing X, I just suspect that > virtually all the users who gave glowing reports to CFS comparing it to SD > had no idea it had reniced X to -19 behind their back and that they were > comparing it to SD running X at nice 0. I think had they been comparing > CFS with X nice -19 to SD running nice -10 in this interactivity soft and > squishy comparison land their thoughts might have been different. I missed > it in the announcement and had to go looking in the code since Willy just > kinda tripped over it unwittingly as well. I tried this with the vesa driver of X, and reflect from the mesa-demos heavily starves new window creation on cfs-v4 with X niced -19. X reniced to 0 removes these starves. On SD, X reniced to -10 works great. Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Saturday 21 April 2007, Con Kolivas wrote: >On Sunday 22 April 2007 04:17, Gene Heskett wrote: >> More first impressions of sd-0.44 vs CFS-v4 > >Thanks Gene. > >> CFS-v4 is quite smooth in terms of the users experience but after >> prolonged observations approaching 24 hours, it appears to choke the cpu >> hog off a bit even when the system has nothing else to do. My amanda runs >> went from 1 to 1.5 hours depending on how much time it took gzip to handle >> the amount of data tar handed it, up to about 165m & change, or nearly 3 >> hours pretty consistently over 5 runs. >> >> sd-0.44 so far seems to be handling the same load (theres a backup running >> right now) fairly well also, and possibly theres a bit more snap to the >> system now. A switch to screen 1 from this screen 8, and the loading of >> that screen image, which is the Cassini shot of saturn from the backside, >> the one showing that teeny dot to the left of Saturn that is actually us, >> took 10 seconds with the stock 2.6.21-rc7, 3 seconds with the best of >> Ingo's patches, and now with Con's latest, is 1 second flat. Another >> screen however is 4 seconds, so maybe that first scren had been looked at >> since I rebooted. However, amanda is still getting estimates so gzip >> hasn't put a tiewrap around the kernels neck just yet. >> >> Some minutes later, gzip is smunching /usr/src, and the machine doesn't >> even know its running as sd-0.44 isn't giving gzip more than 75% to gzip, >> and probably averaging less than 50%. And it scared me a bit as it started >> out at not over 5% for the first minute or so. Running in the 70's now >> according to gkrellm, with an occasional blip to 95%. And the machine >> generally feels good. >> >> I had previously given CFS-v4 a 95 score but that was before I saw the >> general slowdown, and I believe my first impression of this one is also a >> 95. This on a scale of the best one of the earlier CFS patches being 100, >> and stock 2.6.21-rc7 gets a 0.0. This scheduler seems to be giving gzip >> ever more cpu as time progresses, and the cpu is warming up quite nicely, >> from about 132F idling to 149.9F now. And my keyboard is still alive and >> well. > >I'm not sure how much weight to put on what you see as the measured cpu > usage. I have a feeling it's being wrongly reported in SD currently. > Concentrate more on the actual progress and behaviour of things as you've > already done. > >> Generally speaking, Con, I believe this one is also a keeper. And we'll >> see how long a backup run takes. It looks as if it could have been 10 minutes quicker according to amplot, but that's entirely within the expected variations that amanda's scheduler might do to it. But the one that just finished, running under CFS-v5 was only 1h:47m, not including the verify run. The previous backup using sd-0.44, took 2h:28m for a similar but not identical operation according to amplot. That's a big enough diff to be an indicator I believe, but without knowing how much of that time was burned by gzip, its an apples and oranges compare. We'll see if it repeats, I coded 'catchup' to do 2 in a row. >Great thanks for feedback. You're quite welcome, Con. ATM I'm doing the same thing again but booted to a CFS-v5 delta that Ingo sent me privately, and except for the kmail lag/freezes everything is cool except the cpu, it managed to hit 150.7F during the height of one of the gzip -best smunching operations. I believe the /dev/hdd writes are cranked well up from the earlier CSF patches also. Unforch, this isn't something that's been coded into amplot, so I'm stuck watching the hdd display in gkrellm and making SWAG's. And we all know what they are worth. I've made a lot of them in my 72 years, and my track record, with some glaring exceptions like my 2nd wife that I won't bore you with the details of, has been fairly decent. :) -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) You will be run over by a beer truck. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Saturday 21 April 2007 22:12, Willy Tarreau wrote: > 2) SD-0.44 > >Feels good, but becomes jerky at moderately high loads. I've started >64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system >always responds correctly but under X, mouse jumps quite a bit and >typing in xterm or even text console feels slightly jerky. The CPU is >not completely used, and the load varies a lot (see below). However, >the load is shared equally between all 64 ocbench, and they do not >deviate even after 4000 iterations. X uses less than 1% CPU during >those tests. Found it. I broke SMP balancing again so there is serious scope for improvement on SMP hardware. That explains the huge load variations. Expect yet another fix soon, which should improve behaviour further :) -- -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On 4/21/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: And how the hell do you imagine you'd even *know* what thread holds the futex? We know this in most cases. This is information recorded, for instance, in the mutex data structure. You might have missed my "the interface must be extended" part. This means the PID of the owning thread will have to be passed done. For PI mutexes this is not necessary since the kernel already has access to the information. The whole point of the "f" part of the mutex is that it's fast, and we never see the non-contended case in the kernel. See above. Believe me, I know how futexes work. But I also know what additional information we collect. For mutexes and in part for rwlocks we know which thread owns the sync object. In that case we can easily provide the kernel with the information. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sunday 22 April 2007 04:17, Gene Heskett wrote: > More first impressions of sd-0.44 vs CFS-v4 Thanks Gene. > > CFS-v4 is quite smooth in terms of the users experience but after prolonged > observations approaching 24 hours, it appears to choke the cpu hog off a > bit even when the system has nothing else to do. My amanda runs went from > 1 to 1.5 hours depending on how much time it took gzip to handle the amount > of data tar handed it, up to about 165m & change, or nearly 3 hours pretty > consistently over 5 runs. > > sd-0.44 so far seems to be handling the same load (theres a backup running > right now) fairly well also, and possibly theres a bit more snap to the > system now. A switch to screen 1 from this screen 8, and the loading of > that screen image, which is the Cassini shot of saturn from the backside, > the one showing that teeny dot to the left of Saturn that is actually us, > took 10 seconds with the stock 2.6.21-rc7, 3 seconds with the best of > Ingo's patches, and now with Con's latest, is 1 second flat. Another screen > however is 4 seconds, so maybe that first scren had been looked at since I > rebooted. However, amanda is still getting estimates so gzip hasn't put a > tiewrap around the kernels neck just yet. > > Some minutes later, gzip is smunching /usr/src, and the machine doesn't > even know its running as sd-0.44 isn't giving gzip more than 75% to gzip, > and probably averaging less than 50%. And it scared me a bit as it started > out at not over 5% for the first minute or so. Running in the 70's now > according to gkrellm, with an occasional blip to 95%. And the machine > generally feels good. > > I had previously given CFS-v4 a 95 score but that was before I saw the > general slowdown, and I believe my first impression of this one is also a > 95. This on a scale of the best one of the earlier CFS patches being 100, > and stock 2.6.21-rc7 gets a 0.0. This scheduler seems to be giving gzip > ever more cpu as time progresses, and the cpu is warming up quite nicely, > from about 132F idling to 149.9F now. And my keyboard is still alive and > well. I'm not sure how much weight to put on what you see as the measured cpu usage. I have a feeling it's being wrongly reported in SD currently. Concentrate more on the actual progress and behaviour of things as you've already done. > Generally speaking, Con, I believe this one is also a keeper. And we'll > see how long a backup run takes. Great thanks for feedback. -- -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sunday 22 April 2007 08:54, Denis Vlasenko wrote: > On Saturday 21 April 2007 18:00, Ingo Molnar wrote: > > correct. Note that Willy reniced X back to 0 so it had no relevance on > > his test. Also note that i pointed this change out in the -v4 CFS > > > > announcement: > > || Changes since -v3: > > || > > || - usability fix: automatic renicing of kernel threads such as > > ||keventd, OOM tasks and tasks doing privileged hardware access > > ||(such as Xorg). > > > > i've attached it below in a standalone form, feel free to put it into > > SD! :) > > But X problems have nothing to do with "privileged hardware access". > X problems are related to priority inversions between server and client > processes, and "one server process - many client processes" case. It's not a privileged hardware access reason that this code is there. This is obfuscation/advertising to make it look like there is a valid reason for X getting negative nice levels somehow in the kernel to make interactive testing of CFS better by default. -- -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sunday 22 April 2007 02:00, Ingo Molnar wrote: > * Con Kolivas <[EMAIL PROTECTED]> wrote: > > > Feels even better, mouse movements are very smooth even under high > > > load. I noticed that X gets reniced to -19 with this scheduler. > > > I've not looked at the code yet but this looked suspicious to me. > > > I've reniced it to 0 and it did not change any behaviour. Still > > > very good. > > > > Looks like this code does it: > > > > +int sysctl_sched_privileged_nice_level __read_mostly = -19; > > correct. Oh I definitely was not advocating against renicing X, I just suspect that virtually all the users who gave glowing reports to CFS comparing it to SD had no idea it had reniced X to -19 behind their back and that they were comparing it to SD running X at nice 0. I think had they been comparing CFS with X nice -19 to SD running nice -10 in this interactivity soft and squishy comparison land their thoughts might have been different. I missed it in the announcement and had to go looking in the code since Willy just kinda tripped over it unwittingly as well. > Note that Willy reniced X back to 0 so it had no relevance on > his test. Oh yes I did notice that, but since the array swap is the remaining longest deadline in SD which would cause noticeable jerks, renicing X on SD by default would make the experience very different since reniced tasks do much better over array swaps compared to non niced tasks. I really should go and make the whole thing one circular list and blow away the array swap (if I can figure out how to do it). > Also note that i pointed this change out in the -v4 CFS > > announcement: > || Changes since -v3: > || > || - usability fix: automatic renicing of kernel threads such as > ||keventd, OOM tasks and tasks doing privileged hardware access > ||(such as Xorg). Reading the changelog in the gloss-over fashion that I unfortunately did, even I missed it. > i've attached it below in a standalone form, feel free to put it into > SD! :) Hmm well I have tried my very best to do all the changes without changing "policy" as much as possible since that trips over so many emotive issues that noone can agree on, and I don't have a strong opinion on this as I thought it would be better for it to be a config option for X in userspace instead. Either way it needs to be turned on/off by admin and doing it by default in the kernel is... not universally accepted as good. What else accesses ioports that can get privileged nice levels? Does this make it relatively exploitable just by poking an ioport? > Ingo > > --- > arch/i386/kernel/ioport.c | 13 ++--- > arch/x86_64/kernel/ioport.c |8 ++-- > drivers/block/loop.c|5 - > include/linux/sched.h |7 +++ > kernel/sched.c | 40 Thanks for the patch. I'll consider it. Since end users are testing this in fuzzy interactivity land I may simply be forced to do this just for comparisons to be meaningful between CFS and SD otherwise they're not really comparing them on a level playing field. I had almost given up SD for dead meat with all the momentum CFS had gained... until recently. -- -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, 21 Apr 2007, Ulrich Drepper wrote: > > If you do this, and it has been requested many a times, then please > generalize it. We have the same issue with futexes. If a FUTEX_WAIT > call is issues the remaining time in the slot should be given to the > thread currently owning the futex. And how the hell do you imagine you'd even *know* what thread holds the futex? The whole point of the "f" part of the mutex is that it's fast, and we never see the non-contended case in the kernel. So we know who *blocks*, but we don't know who actually didn't block. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On 4/21/07, Kyle Moffett <[EMAIL PROTECTED]> wrote: >> It might be nice if it was possible to actively contribute your CPU >> time to a child process. For example: >> int sched_donate(pid_t pid, struct timeval *time, int percentage); On Sat, Apr 21, 2007 at 12:49:52PM -0700, Ulrich Drepper wrote: > If you do this, and it has been requested many a times, then please > generalize it. We have the same issue with futexes. If a FUTEX_WAIT > call is issues the remaining time in the slot should be given to the > thread currently owning the futex. For non-PI futexes this needs an > extension of the interface but I would be up for that. It can have > big benefits on the throughput of an application. It's encouraging to hear support for a more full-featured API (or, for that matter, any response at all) on this front. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Saturday 21 April 2007 18:00, Ingo Molnar wrote: > correct. Note that Willy reniced X back to 0 so it had no relevance on > his test. Also note that i pointed this change out in the -v4 CFS > announcement: > > || Changes since -v3: > || > || - usability fix: automatic renicing of kernel threads such as > ||keventd, OOM tasks and tasks doing privileged hardware access > ||(such as Xorg). > > i've attached it below in a standalone form, feel free to put it into > SD! :) But X problems have nothing to do with "privileged hardware access". X problems are related to priority inversions between server and client processes, and "one server process - many client processes" case. I think syncronous nature of Xlib (clients cannot fire-and-forget their commands to X server, with Xlib each command waits for ACK from server) also add some amount of pain. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On 4/21/07, Kyle Moffett <[EMAIL PROTECTED]> wrote: It might be nice if it was possible to actively contribute your CPU time to a child process. For example: int sched_donate(pid_t pid, struct timeval *time, int percentage); If you do this, and it has been requested many a times, then please generalize it. We have the same issue with futexes. If a FUTEX_WAIT call is issues the remaining time in the slot should be given to the thread currently owning the futex. For non-PI futexes this needs an extension of the interface but I would be up for that. It can have big benefits on the throughput of an application. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Jan Engelhardt <[EMAIL PROTECTED]> wrote: > > i've attached it below in a standalone form, feel free to put it > > into SD! :) > > Assume X went crazy (lacking any statistics, I make the unproven > statement that this happens more often than kthreads going berserk), > then having it niced with minus something is not too nice. i've not experienced a 'runaway X' personally, at most it would crash or lock up ;) The value is boot-time and sysctl configurable as well back to 0. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Apr 21, 2007, at 12:42:41, William Lee Irwin III wrote: On Sat, 21 Apr 2007, Willy Tarreau wrote: If you remember, with 50/50, I noticed some difficulties to fork many processes. I think that during a fork(), the parent has a higher probability of forking other processes than the child. So at least, we should use something like 67/33 or 75/25 for parent/ child. On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote: It would be even better to simply have the rule: - child gets almost no points at startup - but when a parent does a "waitpid()" call and blocks, it will spread out its points to the childred (the "vfork()" blocking is another case that is really the same). This is a very special kind of "priority inversion" logic: you give higher priority to the things you wait for. Not because of holding any locks, but simply because a blockign waitpid really is a damn big hint that "ok, the child now works for the parent". An in-kernel scheduler API might help. void yield_to(struct task_struct *)? A userspace API might be nice, too. e.g. int sched_yield_to(pid_t). It might be nice if it was possible to actively contribute your CPU time to a child process. For example: int sched_donate(pid_t pid, struct timeval *time, int percentage); Maybe a way to pass CPU time over a UNIX socket (analogous to SCM_RIGHTS), along with information on what process/user passed it That would make it possible to really fix X properly on a local system. You could make the X client library pass CPU time to the X server whenever it requests a CPU-intensive rendering operation. Ordinarily X would nice all of its client service threads to +10, but when a client passes CPU time to its thread over the socket, then its service thread temporarily gets the scheduling properties of the client. I'm not a scheduler guru, but that's what makes the most sense from an application-programmer point of view. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Saturday 21 April 2007, Willy Tarreau wrote: >Hi Ingo, Hi Con, > >I promised to perform some tests on your code. I'm short in time right now, >but I observed behaviours that should be commented on. > >1) machine : dual athlon 1533 MHz, 1G RAM, kernel 2.6.21-rc7 + either > scheduler Test: ./ocbench -R 25 -S 75 -x 8 -y 8 > ocbench: http://linux.1wt.eu/sched/ > >2) SD-0.44 > > Feels good, but becomes jerky at moderately high loads. I've started > 64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system > always responds correctly but under X, mouse jumps quite a bit and > typing in xterm or even text console feels slightly jerky. The CPU is > not completely used, and the load varies a lot (see below). However, > the load is shared equally between all 64 ocbench, and they do not > deviate even after 4000 iterations. X uses less than 1% CPU during > those tests. > > Here's the vmstat output : > >[EMAIL PROTECTED]:~$ vmstat 1 > procs memory swap io system > cpu r b w swpd free buff cache si sobibo incs us > sy id 0 0 0 0 919856 6648 577880022 24 148 > 31 49 20 0 0 0 0 919856 6648 5778800 0 02 > 285 32 50 19 28 0 0 0 919836 6648 5778800 0 0 > 0 331 24 40 36 64 0 0 0 919836 6648 5778800 0 0 >1 618 23 40 37 65 0 0 0 919836 6648 5778800 0 > 00 571 21 36 43 35 0 0 0 919836 6648 5778800 0 > 03 382 32 50 18 2 0 0 0 919836 6648 5778800 > 0 00 308 37 61 2 8 0 0 0 919836 6648 5778800 > 0 01 533 36 65 0 32 0 0 0 919768 6648 577880 > 0 0 0 93 706 33 62 5 62 0 0 0 919712 6648 577880 >0 0 0 65 617 32 54 13 63 0 0 0 919712 6648 57788 > 00 0 01 569 28 48 23 40 0 0 0 919712 6648 > 5778800 0 00 427 26 50 24 4 0 0 0 919712 > 6648 5778800 0 01 382 29 48 23 4 0 0 0 919712 > 6648 5778800 0 00 383 34 65 0 14 0 0 0 > 919712 6648 5778800 0 01 769 39 61 0 40 0 0 > 0 919712 6648 5778800 0 00 384 37 52 11 54 0 0 > 0 919712 6648 5778800 0 01 715 31 60 8 58 0 > 2 0 919712 6648 5778800 0 01 611 34 65 0 41 > 0 0 0 919712 6648 5778800 0 0 19 395 28 45 27 > 0 0 0 0 919712 6648 5778800 0 0 31 421 23 32 > 45 0 0 0 0 919712 6648 5778800 0 0 31 328 34 > 44 22 29 0 0 0 919712 6648 5778800 0 0 34 369 > 32 43 25 65 0 0 0 919712 6648 5778800 0 0 31 > 410 24 35 40 47 0 1 0 919712 6648 5778800 0 0 > 42 538 25 39 35 > >3) CFS-v4 > > Feels even better, mouse movements are very smooth even under high load. > I noticed that X gets reniced to -19 with this scheduler. I've not looked > at the code yet but this looked suspicious to me. I've reniced it to 0 and > it did not change any behaviour. Still very good. The 64 ocbench share > equal CPU time and show exact same progress after 2000 iterations. The CPU > load is more smoothly spread according to vmstat, and there's no idle (see > below). BUT I now think it was wrong to let new processes start with no > timeslice at all, because it can take tens of seconds to start a new > process when only 64 ocbench are there. Simply starting "killall ocbench" > takes about 10 seconds. On a smaller machine (VIA C3-533), it took me more > than one minute to do "su -", even from console, so that's not X. BTW, X > uses less than 1% CPU during those tests. > >[EMAIL PROTECTED]:~$ vmstat 1 > procs memory swap io system > cpu r b w swpd free buff cache si sobibo incs us > sy id 12 0 2 0 922120 6532 5754000 29929 31 386 > 17 27 57 12 0 2 0 922096 6532 5755600 0 01 > 776 37 63 0 14 0 2 0 922096 6532 5755600 0 0 > 1 782 35 65 0 13 0 1 0 922096 6532 5755600 0 0 >0 782 38 62 0 14 0 1 0 922096 6532 5755600 0 > 01 782 36 64 0 13 0 1 0 922096 6532 5755600 0 > 02 785 38 62 0 13 0 1 0 922096 6532 5755600 > 0 01 774 35 65 0 14 0 1 0 922096 6532 5755600 > 0 00 784 36 64 0 13 0 1 0 922096 6532 575560 > 0 0 01 767 37 63 0 13 0 1 0 922096 6532 57556 > 00 0 01 785 41 59 0 14 0 1 0 92209
Re: [REPORT] cfs-v4 vs sd-0.44
On 4/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote: on a simple 'ls' command: 21310 clone(child_stack=0, ...) = 21399 ... 21399 execve("/bin/ls", ... 21310 waitpid(-1, the PID is -1 so we dont actually know which task we are waiting for. That's a special case. Most programs don't do this. In fact, in multi-threaded code you better never do it since such an unqualified wait might catch the child another thread waits for (particularly bad if one thread uses system()). And even in the case of bash, we probably can change to code to use a qualified wait in case there are no other children. This is known at any time and I expect that most of the time there are no background processes. At least in shell scripts. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Apr 21 2007 18:00, Ingo Molnar wrote: >* Con Kolivas <[EMAIL PROTECTED]> wrote: > >> > Feels even better, mouse movements are very smooth even under high >> > load. I noticed that X gets reniced to -19 with this scheduler. >> > I've not looked at the code yet but this looked suspicious to me. >> > I've reniced it to 0 and it did not change any behaviour. Still >> > very good. >> >> Looks like this code does it: >> >> +int sysctl_sched_privileged_nice_level __read_mostly = -19; > >correct. Note that Willy reniced X back to 0 so it had no relevance on >his test. Also note that i pointed this change out in the -v4 CFS >announcement: > >|| Changes since -v3: >|| >|| - usability fix: automatic renicing of kernel threads such as >||keventd, OOM tasks and tasks doing privileged hardware access >||(such as Xorg). > >i've attached it below in a standalone form, feel free to put it into >SD! :) Assume X went crazy (lacking any statistics, I make the unproven statement that this happens more often than kthreads going berserk), then having it niced with minus something is not too nice. Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Apr 21, 2007, at 12:18, Willy Tarreau wrote: Also, I believe that (in shells), most forked processes do not even consume a full timeslice (eg: $(uname -n) is very fast). This means that assigning them with a shorter one will not hurt them while preserving the shell's performance against CPU hogs. On a fast machine, during regression testing of GCC, I've noticed we create an average of 500 processes per second during an hour or so. There are other work loads like this. So, most processes start, execute and complete in 2ms. How does fairness work in a situation like this? -Geert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, Apr 21, 2007 at 06:53:47PM +0200, Ingo Molnar wrote: > > * Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > It would be even better to simply have the rule: > > - child gets almost no points at startup > > - but when a parent does a "waitpid()" call and blocks, it will spread > >out its points to the childred (the "vfork()" blocking is another case > >that is really the same). > > > > This is a very special kind of "priority inversion" logic: you give > > higher priority to the things you wait for. Not because of holding any > > locks, but simply because a blockign waitpid really is a damn big hint > > that "ok, the child now works for the parent". > > yeah. One problem i can see with the implementation of this though is > that shells typically do nonspecific waits - for example bash does this > on a simple 'ls' command: > > 21310 clone(child_stack=0, ...) = 21399 > ... > 21399 execve("/bin/ls", > ... > 21310 waitpid(-1, > > the PID is -1 so we dont actually know which task we are waiting for. We > could use the first entry from the p->children list, but that looks too > specific of a hack to me. It should catch most of the > synchronous-helper-task cases though. The last one should be more appropriate IMHO. If you waitpid(), it's very likely that you're waiting for the result of the very last fork(). Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote: > > > On Sat, 21 Apr 2007, Willy Tarreau wrote: > > > > If you remember, with 50/50, I noticed some difficulties to fork many > > processes. I think that during a fork(), the parent has a higher probability > > of forking other processes than the child. So at least, we should use > > something like 67/33 or 75/25 for parent/child. > > It would be even better to simply have the rule: > - child gets almost no points at startup > - but when a parent does a "waitpid()" call and blocks, it will spread >out its points to the childred (the "vfork()" blocking is another case >that is really the same). > > This is a very special kind of "priority inversion" logic: you give higher > priority to the things you wait for. Not because of holding any locks, but > simply because a blockign waitpid really is a damn big hint that "ok, the > child now works for the parent". I like this idea a lot. I don't know if it can be applied to pipes and unix sockets, but it's clearly a way of saying "hurry up, I'm waiting for you" which seems natural with inter-process communications. Also, if we can do this on unix sockets, it would help a lot with X ! Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > It would be even better to simply have the rule: > - child gets almost no points at startup > - but when a parent does a "waitpid()" call and blocks, it will spread >out its points to the childred (the "vfork()" blocking is another case >that is really the same). > > This is a very special kind of "priority inversion" logic: you give > higher priority to the things you wait for. Not because of holding any > locks, but simply because a blockign waitpid really is a damn big hint > that "ok, the child now works for the parent". yeah. One problem i can see with the implementation of this though is that shells typically do nonspecific waits - for example bash does this on a simple 'ls' command: 21310 clone(child_stack=0, ...) = 21399 ... 21399 execve("/bin/ls", ... 21310 waitpid(-1, the PID is -1 so we dont actually know which task we are waiting for. We could use the first entry from the p->children list, but that looks too specific of a hack to me. It should catch most of the synchronous-helper-task cases though. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, 21 Apr 2007, Willy Tarreau wrote: >> If you remember, with 50/50, I noticed some difficulties to fork many >> processes. I think that during a fork(), the parent has a higher probability >> of forking other processes than the child. So at least, we should use >> something like 67/33 or 75/25 for parent/child. On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote: > It would be even better to simply have the rule: > - child gets almost no points at startup > - but when a parent does a "waitpid()" call and blocks, it will spread >out its points to the childred (the "vfork()" blocking is another case >that is really the same). > This is a very special kind of "priority inversion" logic: you give higher > priority to the things you wait for. Not because of holding any locks, but > simply because a blockign waitpid really is a damn big hint that "ok, the > child now works for the parent". An in-kernel scheduler API might help. void yield_to(struct task_struct *)? A userspace API might be nice, too. e.g. int sched_yield_to(pid_t). -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, Apr 21, 2007 at 06:00:08PM +0200, Ingo Molnar wrote: > arch/i386/kernel/ioport.c | 13 ++--- > arch/x86_64/kernel/ioport.c |8 ++-- > drivers/block/loop.c|5 - > include/linux/sched.h |7 +++ > kernel/sched.c | 40 > kernel/workqueue.c |2 +- > mm/oom_kill.c |4 +++- > 7 files changed, 71 insertions(+), 8 deletions(-) Yum. I'm going to see what this does for glxgears (I presume it's a screensaver) on my dual G5 driving a 42" wall-mounted TV for a display. ;) More seriously, there should be more portable ways of doing this. I suspect even someone using fbdev on i386/x86-64 might be left out here. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, 21 Apr 2007, Willy Tarreau wrote: > > If you remember, with 50/50, I noticed some difficulties to fork many > processes. I think that during a fork(), the parent has a higher probability > of forking other processes than the child. So at least, we should use > something like 67/33 or 75/25 for parent/child. It would be even better to simply have the rule: - child gets almost no points at startup - but when a parent does a "waitpid()" call and blocks, it will spread out its points to the childred (the "vfork()" blocking is another case that is really the same). This is a very special kind of "priority inversion" logic: you give higher priority to the things you wait for. Not because of holding any locks, but simply because a blockign waitpid really is a damn big hint that "ok, the child now works for the parent". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, Apr 21, 2007 at 05:46:14PM +0200, Ingo Molnar wrote: > > * Willy Tarreau <[EMAIL PROTECTED]> wrote: > > > I promised to perform some tests on your code. I'm short in time right > > now, but I observed behaviours that should be commented on. > > thanks for the feedback! > > > 3) CFS-v4 > > > > Feels even better, mouse movements are very smooth even under high > > load. I noticed that X gets reniced to -19 with this scheduler. I've > > not looked at the code yet but this looked suspicious to me. I've > > reniced it to 0 and it did not change any behaviour. Still very > > good. The 64 ocbench share equal CPU time and show exact same > > progress after 2000 iterations. The CPU load is more smoothly spread > > according to vmstat, and there's no idle (see below). BUT I now > > think it was wrong to let new processes start with no timeslice at > > all, because it can take tens of seconds to start a new process when > > only 64 ocbench are there. [...] > > ok, i'll modify that portion and add back the 50%/50% parent/child CPU > time sharing approach again. (which CFS had in -v1) That should not > change the rest of your test and should improve the task startup > characteristics. If you remember, with 50/50, I noticed some difficulties to fork many processes. I think that during a fork(), the parent has a higher probability of forking other processes than the child. So at least, we should use something like 67/33 or 75/25 for parent/child. There are many shell-scripts out there doing a lot of fork(), and it should be reasonable to let them keep some CPU to continue to work. Also, I believe that (in shells), most forked processes do not even consume a full timeslice (eg: $(uname -n) is very fast). This means that assigning them with a shorter one will not hurt them while preserving the shell's performance against CPU hogs. Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, Apr 21, 2007 at 06:00:08PM +0200, Ingo Molnar wrote: > > * Con Kolivas <[EMAIL PROTECTED]> wrote: > > > > Feels even better, mouse movements are very smooth even under high > > > load. I noticed that X gets reniced to -19 with this scheduler. > > > I've not looked at the code yet but this looked suspicious to me. > > > I've reniced it to 0 and it did not change any behaviour. Still > > > very good. > > > > Looks like this code does it: > > > > +int sysctl_sched_privileged_nice_level __read_mostly = -19; > > correct. Note that Willy reniced X back to 0 so it had no relevance on > his test. Anyway, my X was mostly unused (below 1% CPU), which was my intent when replacing glxgears by ocbench. We have not settled yet about how to handle the special case for X. Let's at least try to get the best schedulers without this problem, then see how to make them behave the best taking X into account. > Also note that i pointed this change out in the -v4 CFS > announcement: > > || Changes since -v3: > || > || - usability fix: automatic renicing of kernel threads such as > ||keventd, OOM tasks and tasks doing privileged hardware access > ||(such as Xorg). > > i've attached it below in a standalone form, feel free to put it into > SD! :) Con, I think it could be a good idea since you recommend to renice X with SD. Most of the problem users are facing with renicing X is that they need to change their configs or scripts. If the kernel can reliably detect X and handle it differently, why not do it ? It makes me think that this hint might be used to set some flags in the task struct in order to apply different processing than just renicing. It is indeed possible that nice is not the best solution and that something else would be even better (eg: longer timeslices, but not changing priority in the queues). Just an idea anyway. OK, back to work ;-) Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Con Kolivas <[EMAIL PROTECTED]> wrote: > > Feels even better, mouse movements are very smooth even under high > > load. I noticed that X gets reniced to -19 with this scheduler. > > I've not looked at the code yet but this looked suspicious to me. > > I've reniced it to 0 and it did not change any behaviour. Still > > very good. > > Looks like this code does it: > > +int sysctl_sched_privileged_nice_level __read_mostly = -19; correct. Note that Willy reniced X back to 0 so it had no relevance on his test. Also note that i pointed this change out in the -v4 CFS announcement: || Changes since -v3: || || - usability fix: automatic renicing of kernel threads such as ||keventd, OOM tasks and tasks doing privileged hardware access ||(such as Xorg). i've attached it below in a standalone form, feel free to put it into SD! :) Ingo --- arch/i386/kernel/ioport.c | 13 ++--- arch/x86_64/kernel/ioport.c |8 ++-- drivers/block/loop.c|5 - include/linux/sched.h |7 +++ kernel/sched.c | 40 kernel/workqueue.c |2 +- mm/oom_kill.c |4 +++- 7 files changed, 71 insertions(+), 8 deletions(-) Index: linux/arch/i386/kernel/ioport.c === --- linux.orig/arch/i386/kernel/ioport.c +++ linux/arch/i386/kernel/ioport.c @@ -64,9 +64,15 @@ asmlinkage long sys_ioperm(unsigned long if ((from + num <= from) || (from + num > IO_BITMAP_BITS)) return -EINVAL; - if (turn_on && !capable(CAP_SYS_RAWIO)) - return -EPERM; - + if (turn_on) { + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + /* +* Task will be accessing hardware IO ports, +* mark it as special with the scheduler too: +*/ + sched_privileged_task(current); + } /* * If it's the first ioperm() call in this thread's lifetime, set the * IO bitmap up. ioperm() is much less timing critical than clone(), @@ -145,6 +151,7 @@ asmlinkage long sys_iopl(unsigned long u if (level > old) { if (!capable(CAP_SYS_RAWIO)) return -EPERM; + sched_privileged_task(current); } t->iopl = level << 12; regs->eflags = (regs->eflags & ~X86_EFLAGS_IOPL) | t->iopl; Index: linux/arch/x86_64/kernel/ioport.c === --- linux.orig/arch/x86_64/kernel/ioport.c +++ linux/arch/x86_64/kernel/ioport.c @@ -41,8 +41,11 @@ asmlinkage long sys_ioperm(unsigned long if ((from + num <= from) || (from + num > IO_BITMAP_BITS)) return -EINVAL; - if (turn_on && !capable(CAP_SYS_RAWIO)) - return -EPERM; + if (turn_on) { + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + sched_privileged_task(current); + } /* * If it's the first ioperm() call in this thread's lifetime, set the @@ -113,6 +116,7 @@ asmlinkage long sys_iopl(unsigned int le if (level > old) { if (!capable(CAP_SYS_RAWIO)) return -EPERM; + sched_privileged_task(current); } regs->eflags = (regs->eflags &~ X86_EFLAGS_IOPL) | (level << 12); return 0; Index: linux/drivers/block/loop.c === --- linux.orig/drivers/block/loop.c +++ linux/drivers/block/loop.c @@ -588,7 +588,10 @@ static int loop_thread(void *data) */ current->flags |= PF_NOFREEZE; - set_user_nice(current, -20); + /* +* The loop thread is important enough to be given a boost: +*/ + sched_privileged_task(current); while (!kthread_should_stop() || lo->lo_bio) { Index: linux/include/linux/sched.h === --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1256,6 +1256,13 @@ static inline int rt_mutex_getprio(struc #endif extern void set_user_nice(struct task_struct *p, long nice); +/* + * Task has special privileges, give it more CPU power: + */ +extern void sched_privileged_task(struct task_struct *p); + +extern int sysctl_sched_privileged_nice_level; + extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); extern int can_nice(const struct task_struct *p, const int nice); Index: linux/kernel/sched.c === --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -3251,6 +3251,46 @@ out_unlock: EXPORT_SYMBOL(set_user_nice); /* + * Nice level for privileged tasks. (can be set to 0 for this + * to be turned off)
Re: [REPORT] cfs-v4 vs sd-0.44
On Saturday 21 April 2007 22:12, Willy Tarreau wrote: > I promised to perform some tests on your code. I'm short in time right now, > but I observed behaviours that should be commented on. > Feels even better, mouse movements are very smooth even under high load. > I noticed that X gets reniced to -19 with this scheduler. I've not looked > at the code yet but this looked suspicious to me. Looks like this code does it: +int sysctl_sched_privileged_nice_level __read_mostly = -19; allows anything that sets sched_privileged_task one way or another gets nice -19, and this is enabled by default. --- linux-cfs-2.6.20.7.q.orig/arch/i386/kernel/ioport.c +++ linux-cfs-2.6.20.7.q/arch/i386/kernel/ioport.c + if (turn_on) { + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + /* +* Task will be accessing hardware IO ports, +* mark it as special with the scheduler too: +*/ + sched_privileged_task(current); + } presumably that selects out X as a privileged task... and sets it to nice -19 by default. -- -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Willy Tarreau <[EMAIL PROTECTED]> wrote: > I promised to perform some tests on your code. I'm short in time right > now, but I observed behaviours that should be commented on. thanks for the feedback! > 3) CFS-v4 > > Feels even better, mouse movements are very smooth even under high > load. I noticed that X gets reniced to -19 with this scheduler. I've > not looked at the code yet but this looked suspicious to me. I've > reniced it to 0 and it did not change any behaviour. Still very > good. The 64 ocbench share equal CPU time and show exact same > progress after 2000 iterations. The CPU load is more smoothly spread > according to vmstat, and there's no idle (see below). BUT I now > think it was wrong to let new processes start with no timeslice at > all, because it can take tens of seconds to start a new process when > only 64 ocbench are there. [...] ok, i'll modify that portion and add back the 50%/50% parent/child CPU time sharing approach again. (which CFS had in -v1) That should not change the rest of your test and should improve the task startup characteristics. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Sat, Apr 21, 2007 at 10:40:18PM +1000, Con Kolivas wrote: > On Saturday 21 April 2007 22:12, Willy Tarreau wrote: > > Hi Ingo, Hi Con, > > > > I promised to perform some tests on your code. I'm short in time right now, > > but I observed behaviours that should be commented on. > > > > 1) machine : dual athlon 1533 MHz, 1G RAM, kernel 2.6.21-rc7 + either > > scheduler Test: ./ocbench -R 25 -S 75 -x 8 -y 8 > >ocbench: http://linux.1wt.eu/sched/ > > > > 2) SD-0.44 > > > >Feels good, but becomes jerky at moderately high loads. I've started > >64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system > >always responds correctly but under X, mouse jumps quite a bit and > >typing in xterm or even text console feels slightly jerky. The CPU is > >not completely used, and the load varies a lot (see below). However, > >the load is shared equally between all 64 ocbench, and they do not > >deviate even after 4000 iterations. X uses less than 1% CPU during > >those tests. > > > >Here's the vmstat output : > [snip] > > > 3) CFS-v4 > > > > Feels even better, mouse movements are very smooth even under high load. > > I noticed that X gets reniced to -19 with this scheduler. I've not looked > > at the code yet but this looked suspicious to me. I've reniced it to 0 > > and it did not change any behaviour. Still very good. The 64 ocbench share > > equal CPU time and show exact same progress after 2000 iterations. The CPU > > load is more smoothly spread according to vmstat, and there's no idle (see > > below). BUT I now think it was wrong to let new processes start with no > > timeslice at all, because it can take tens of seconds to start a new > > process when only 64 ocbench are there. Simply starting "killall ocbench" > > takes about 10 seconds. On a smaller machine (VIA C3-533), it took me more > > than one minute to do "su -", even from console, so that's not X. BTW, X > > uses less than 1% CPU during those tests. > > > > [EMAIL PROTECTED]:~$ vmstat 1 > [snip] > > > 4) first impressions > > > > I think that CFS is based on a more promising concept but is less mature > > and is dangerous right now with certain workloads. SD shows some strange > > behaviours like not using all CPU available and a little jerkyness, but is > > more robust and may be the less risky solution for a first step towards > > a better scheduler in mainline, but it may also probably be the last O(1) > > scheduler, which may be replaced sometime later when CFS (or any other one) > > shows at the same time the smoothness of CFS and the robustness of SD. > > I assumed from your description that you were running X nice 0 during all > this Yes, that's what I did. > testing and left the tunables from both SD and CFS at their defaults; yes too because I don't have enough time to try many combinations this week-end. > this > tends to have the effective equivalent of "timeslice" in CFS smaller than SD. If you look at the CS column in vmstat, you'll see that there's about twice as many context switches with CFS than with SD, meaning the average timeslice would be about twice as short with CFS. But my impression is that some tasks occasionally get very long timeslices with SD while this never happens with CFS, hence the very smooth versus jerky feeling which cannot be explained by just halved timeslices alone. > > I'm sorry not to spend more time on them right now, I hope that other > > people will do. > > Thanks for that interesting testing you've done. The fluctuating cpu load and > the apparently high idle time means there is almost certainly a bug still in > the cpu accounting I do in update_cpu_clock. It looks suspicious to me > already on just my first glance. Fortunately the throughput does not appear > to be adversely affected on other benchmarks so I suspect it's lying about > the idle time and it's not really there. Which means it's likely also > accounting the cpu time wrongly. It is possible that only measurement is wrong because the time was evenly distributed among the 64 processes. Maybe the fix could also prevent some tasks from occasionally stealing one slice and reduce the jerkiness feeling. Anyway, it's just a bit jerky, no more freezes as we've known for years ;-) > Which also means there's something I can fix and improve SD further. > Great stuff, thanks! You're welcome ! Cheers, Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/