Re: [patch] CFS scheduler, v3
On Tue, Apr 24, 2007 at 10:55:45AM -0700, Christoph Lameter wrote: > On Tue, 24 Apr 2007, Siddha, Suresh B wrote: > > yes, we were planning to move this to a different percpu section, where > > all the elements in this new section will be cacheline aligned(both > > at the start, aswell as end) > > I would not call this a per cpu area. It is used by multiple cpus it > seems. not decided about the name yet. but the area is allocated per cpu and yes, the data can be referred by other cpus. > But for 0.5%? on what benchmark? Is is really worth it? famous database workload :) I don't think the new section will be added for this 0.5%. This is a straight fwd optimization, and anyone can plug into this new section in future. thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Tue, 24 Apr 2007, Siddha, Suresh B wrote: > On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote: > > On Tue, 24 Apr 2007, Siddha, Suresh B wrote: > > > Anyhow, this is a straight forward optimization and needs to be done. Do > > > you > > > have any specific concerns? > > > > Yes there should not be contention on per cpu data in principle. The > > point of per cpu data is for the cpu to have access to contention free > > cachelines. > > > > If the data is contented then it should be moved out of per cpu data and > > properly > > placed to minimize contention. Otherwise we will get into cacheline > > aliases (__read_mostly in per cpu??) etc etc in the per cpu areas. > > yes, we were planning to move this to a different percpu section, where > all the elements in this new section will be cacheline aligned(both > at the start, aswell as end) I would not call this a per cpu area. It is used by multiple cpus it seems. But for 0.5%? on what benchmark? Is is really worth it? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote: > On Tue, 24 Apr 2007, Siddha, Suresh B wrote: > > Anyhow, this is a straight forward optimization and needs to be done. Do you > > have any specific concerns? > > Yes there should not be contention on per cpu data in principle. The > point of per cpu data is for the cpu to have access to contention free > cachelines. > > If the data is contented then it should be moved out of per cpu data and > properly > placed to minimize contention. Otherwise we will get into cacheline > aliases (__read_mostly in per cpu??) etc etc in the per cpu areas. yes, we were planning to move this to a different percpu section, where all the elements in this new section will be cacheline aligned(both at the start, aswell as end) thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Tue, 24 Apr 2007, Siddha, Suresh B wrote: > > .5% is usually in the noise ratio. Are you consistently seeing an > > improvement or is that sporadic? > > No. This is consistent. I am waiting for the perf data on a much much bigger > NUMA box. > > Anyhow, this is a straight forward optimization and needs to be done. Do you > have any specific concerns? Yes there should not be contention on per cpu data in principle. The point of per cpu data is for the cpu to have access to contention free cachelines. If the data is contented then it should be moved out of per cpu data and properly placed to minimize contention. Otherwise we will get into cacheline aliases (__read_mostly in per cpu??) etc etc in the per cpu areas. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Tue, Apr 24, 2007 at 10:39:48AM -0700, Christoph Lameter wrote: > On Fri, 20 Apr 2007, Siddha, Suresh B wrote: > > > > Last I checked it was workload-dependent, but there were things that > > > hammer it. I mostly know of the remote wakeup issue, but there could > > > be other things besides wakeups that do it, too. > > > > remote wakeup was the main issue and the 0.5% improvement was seen > > on a two node platform. Aligning it reduces the number of remote > > cachelines that needs to be touched as part of this wakeup. > > .5% is usually in the noise ratio. Are you consistently seeing an > improvement or is that sporadic? No. This is consistent. I am waiting for the perf data on a much much bigger NUMA box. Anyhow, this is a straight forward optimization and needs to be done. Do you have any specific concerns? thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, 20 Apr 2007, Siddha, Suresh B wrote: > > Last I checked it was workload-dependent, but there were things that > > hammer it. I mostly know of the remote wakeup issue, but there could > > be other things besides wakeups that do it, too. > > remote wakeup was the main issue and the 0.5% improvement was seen > on a two node platform. Aligning it reduces the number of remote > cachelines that needs to be touched as part of this wakeup. .5% is usually in the noise ratio. Are you consistently seeing an improvement or is that sporadic? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote: > On Wed, 18 Apr 2007, William Lee Irwin III wrote: > > > > > Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. > > False sharing for a per cpu data structure? Are we updating that > structure from other processors? yes. during a remote process wakeup. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 01:03:22PM -0700, William Lee Irwin III wrote: > On Fri, 20 Apr 2007, William Lee Irwin III wrote: > >> I'm not really convinced it's all that worthwhile of an optimization, > >> essentially for the same reasons as you, but presumably there's a > >> benchmark result somewhere that says it matters. I've just not seen it. > > On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote: > > If it is true that we frequently remotely write the per cpu runqueue > > data then we may have a NUMA scalability issue. > > From the discussion on Suresh's thread, it appears to have sped up a > database benchmark 0.5%. > > Last I checked it was workload-dependent, but there were things that > hammer it. I mostly know of the remote wakeup issue, but there could > be other things besides wakeups that do it, too. remote wakeup was the main issue and the 0.5% improvement was seen on a two node platform. Aligning it reduces the number of remote cachelines that needs to be touched as part of this wakeup. thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote: On Wed, 18 Apr 2007, William Lee Irwin III wrote: Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. False sharing for a per cpu data structure? Are we updating that structure from other processors? yes. during a remote process wakeup. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 01:03:22PM -0700, William Lee Irwin III wrote: On Fri, 20 Apr 2007, William Lee Irwin III wrote: I'm not really convinced it's all that worthwhile of an optimization, essentially for the same reasons as you, but presumably there's a benchmark result somewhere that says it matters. I've just not seen it. On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote: If it is true that we frequently remotely write the per cpu runqueue data then we may have a NUMA scalability issue. From the discussion on Suresh's thread, it appears to have sped up a database benchmark 0.5%. Last I checked it was workload-dependent, but there were things that hammer it. I mostly know of the remote wakeup issue, but there could be other things besides wakeups that do it, too. remote wakeup was the main issue and the 0.5% improvement was seen on a two node platform. Aligning it reduces the number of remote cachelines that needs to be touched as part of this wakeup. thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, 20 Apr 2007, Siddha, Suresh B wrote: Last I checked it was workload-dependent, but there were things that hammer it. I mostly know of the remote wakeup issue, but there could be other things besides wakeups that do it, too. remote wakeup was the main issue and the 0.5% improvement was seen on a two node platform. Aligning it reduces the number of remote cachelines that needs to be touched as part of this wakeup. .5% is usually in the noise ratio. Are you consistently seeing an improvement or is that sporadic? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Tue, Apr 24, 2007 at 10:39:48AM -0700, Christoph Lameter wrote: On Fri, 20 Apr 2007, Siddha, Suresh B wrote: Last I checked it was workload-dependent, but there were things that hammer it. I mostly know of the remote wakeup issue, but there could be other things besides wakeups that do it, too. remote wakeup was the main issue and the 0.5% improvement was seen on a two node platform. Aligning it reduces the number of remote cachelines that needs to be touched as part of this wakeup. .5% is usually in the noise ratio. Are you consistently seeing an improvement or is that sporadic? No. This is consistent. I am waiting for the perf data on a much much bigger NUMA box. Anyhow, this is a straight forward optimization and needs to be done. Do you have any specific concerns? thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Tue, 24 Apr 2007, Siddha, Suresh B wrote: .5% is usually in the noise ratio. Are you consistently seeing an improvement or is that sporadic? No. This is consistent. I am waiting for the perf data on a much much bigger NUMA box. Anyhow, this is a straight forward optimization and needs to be done. Do you have any specific concerns? Yes there should not be contention on per cpu data in principle. The point of per cpu data is for the cpu to have access to contention free cachelines. If the data is contented then it should be moved out of per cpu data and properly placed to minimize contention. Otherwise we will get into cacheline aliases (__read_mostly in per cpu??) etc etc in the per cpu areas. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote: On Tue, 24 Apr 2007, Siddha, Suresh B wrote: Anyhow, this is a straight forward optimization and needs to be done. Do you have any specific concerns? Yes there should not be contention on per cpu data in principle. The point of per cpu data is for the cpu to have access to contention free cachelines. If the data is contented then it should be moved out of per cpu data and properly placed to minimize contention. Otherwise we will get into cacheline aliases (__read_mostly in per cpu??) etc etc in the per cpu areas. yes, we were planning to move this to a different percpu section, where all the elements in this new section will be cacheline aligned(both at the start, aswell as end) thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Tue, 24 Apr 2007, Siddha, Suresh B wrote: On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote: On Tue, 24 Apr 2007, Siddha, Suresh B wrote: Anyhow, this is a straight forward optimization and needs to be done. Do you have any specific concerns? Yes there should not be contention on per cpu data in principle. The point of per cpu data is for the cpu to have access to contention free cachelines. If the data is contented then it should be moved out of per cpu data and properly placed to minimize contention. Otherwise we will get into cacheline aliases (__read_mostly in per cpu??) etc etc in the per cpu areas. yes, we were planning to move this to a different percpu section, where all the elements in this new section will be cacheline aligned(both at the start, aswell as end) I would not call this a per cpu area. It is used by multiple cpus it seems. But for 0.5%? on what benchmark? Is is really worth it? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Tue, Apr 24, 2007 at 10:55:45AM -0700, Christoph Lameter wrote: On Tue, 24 Apr 2007, Siddha, Suresh B wrote: yes, we were planning to move this to a different percpu section, where all the elements in this new section will be cacheline aligned(both at the start, aswell as end) I would not call this a per cpu area. It is used by multiple cpus it seems. not decided about the name yet. but the area is allocated per cpu and yes, the data can be referred by other cpus. But for 0.5%? on what benchmark? Is is really worth it? famous database workload :) I don't think the new section will be added for this 0.5%. This is a straight fwd optimization, and anyone can plug into this new section in future. thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
* William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> Suppose a table of nice weights like the following is tuned via >> /proc/: >> -20 21 0 1 >> -1 2 19 0.0476 > > Essentially 1/(n+1) when n >= 0 and 1-n when n < 0. On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote: > ok, thanks for thinking about it. I have changed the nice weight in > CVSv5-to-be so that it defaults to something pretty close to your > suggestion: the ratio between a nice 0 loop and a nice 19 loop is now > set to about 2%. (This something that users requested for some time, the > default ~5% is a tad high when running reniced SETI jobs, etc.) Okay. Maybe what I suggested is too weak vs. too strong. I didn't actually have it in mind as a proposal for general use, but maybe it is good for such. I had more in mind tunability in general, but it's all good. I'd think some curve gentler in intermediate nice levels and stronger at the tails might be better. On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote: > the actual percentage scales almost directly with the nice offset > granularity value, but if this should be exposed to users at all, i > agree that it would be better to directly expose this as some sort of > 'ratio between nice 0 and nice 19 tasks', right? Or some other, more > finegrained metric. Percentile is too coarse i think, and using 0.1% > units isnt intuitive enough i think. The sysctl handler would then > transform that 'human readable' sysctl value into the appropriate > internal nice-offset-granularity value (or whatever mechanism the > implementation ends up using). I vaguely liked specifying the full table, but maybe it's too much for a real user interface. 4-digit or 5-digit fixed point decimal sounds reasonable. On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote: > I'd not do this as a per-nice-level thing but as a single value that > rescales the whole nice level range at once. That's alot less easy to > misconfigure and we've got enough nice levels for users to pick from > almost arbitrarily, as long as they have the ability to influence the > max. > does this sound mostly OK to you? For the most part, yes. I've been mostly looking at how effectively the prioritization algorithms work. I'll be wrapping up writing a testcase to measure all this soon. The basic idea is to take the weights as inputs somehow and then check to see that they're honored. What's appropriate for end-users is a very different thing from what might be appropriate for me. I won't have trouble fiddling with the code, so please do design around what the best interface for end-users might be. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Ingo Molnar wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: I retract this suggestion as it's a very bad idea. It introduces the possibility of starvation via the poor sods at the bottom of the queue having their "on CPU" forever postponed and we all know that even the smallest possibility of starvation will eventually cause problems. I think there should be a rule: Once a task is on the queue its "on CPU" time is immutable. Yeah, fully agreed. Currently i'm using the simple method of p->nice_offset, which plainly just moves the per nice level areas of the tree far enough apart (by a constant offset) so that lower nice levels rarely interact with higher nice levels. Lower nice levels never truly starve because rq->fair_clock increases deterministically and currently the fair_key values are indeed 'immutable' as you suggest. In practice they can starve a bit when one renices thousands of tasks, so i was thinking about the following special-case: to at least make them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task then we could 'share' its p->wait_runtime with that nice 19 task and copy the signal sender's nice_offset. This would in essence pass the right to execute over to the killed task, so that it can tear itself down. This cannot be used to gain an 'unfair advantage' because the signal sender spends its own 'right to execute on the CPU', and because the target task cannot execute any user code anymore when it gets a SIGKILL. In any case, it is clear that rq->raw_cpu_load should be used instead of rq->nr_running, when calculating the fair clock, but i begin to like the nice_offset solution too in addition of this: it's effective in practice and starvation-free in theory, and most importantly, it's very simple. We could even make the nice offset granularity tunable, just in case anyone wants to weaken (or strengthen) the effectivity of nice levels. What do you think, can you see any obvious (or less obvious) showstoppers with this approach? I haven't had a close look at it but from the above description it sounds an order of magnitude more complex than I thought it would be. The idea of different nice levels sounds like a recipe for starvation to me (if it works the way it sounds like it works). I guess I'll have to spend more time reading the code because I don't seem to be able to make sense of the above description in any way that doesn't say "starvation here we come". I'm finding it hard to figure out what the underling principle for the way you're queuing things by reading the code (that's the trouble with reading the code it just tells you what's being done not why -- and sometimes it's even hard to figure out what's being done when there's a lot of indirection). sched-design-CFS.txt isn't much help in this area either. Any chance of a brief description of how it's supposed to work? Key questions are: How do you decide the key value for a task's position in the queue? Is it an absolute time or an offset from the current time? How do you decide when to boot the current task of the queue? Both at wake up of another task and in general play? Peter PS I think that you're trying to do too much in one patch. -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Ingo Molnar wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: I retract this suggestion as it's a very bad idea. It introduces the possibility of starvation via the poor sods at the bottom of the queue having their "on CPU" forever postponed and we all know that even the smallest possibility of starvation will eventually cause problems. I think there should be a rule: Once a task is on the queue its "on CPU" time is immutable. Yeah, fully agreed. Currently i'm using the simple method of p->nice_offset, which plainly just moves the per nice level areas of the tree far enough apart (by a constant offset) so that lower nice levels rarely interact with higher nice levels. Lower nice levels never truly starve because rq->fair_clock increases deterministically and currently the fair_key values are indeed 'immutable' as you suggest. In practice they can starve a bit when one renices thousands of tasks, so i was thinking about the following special-case: to at least make them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task then we could 'share' its p->wait_runtime with that nice 19 task and copy the signal sender's nice_offset. This would in essence pass the right to execute over to the killed task, so that it can tear itself down. This cannot be used to gain an 'unfair advantage' because the signal sender spends its own 'right to execute on the CPU', and because the target task cannot execute any user code anymore when it gets a SIGKILL. In any case, it is clear that rq->raw_cpu_load should be used instead of rq->nr_running, when calculating the fair clock, but i begin to like the nice_offset solution too in addition of this: it's effective in practice and starvation-free in theory, and most importantly, it's very simple. We could even make the nice offset granularity tunable, just in case anyone wants to weaken (or strengthen) the effectivity of nice levels. What do you think, can you see any obvious (or less obvious) showstoppers with this approach? I haven't had a close look at it but from the above description it sounds an order of magnitude more complex than I thought it would be. The idea of different nice levels sounds like a recipe for starvation to me (if it works the way it sounds like it works). I guess I'll have to spend more time reading the code because I don't seem to be able to make sense of the above description in any way that doesn't say "starvation here we come". Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
* William Lee Irwin III <[EMAIL PROTECTED]> wrote: > I suppose this is a special case of the dreaded priority inversion. > What of, say, nice 19 tasks holding fs semaphores and/or mutexes that > nice -19 tasks are waiting to acquire? Perhaps rt_mutex should be the > default mutex implementation. while i agree that it could be an issue, lock inversion is nothing really new, so i'd not go _that_ drastic to convert all mutexes to rtmutexes. (i've taken my -rt/PREEMPT_RT hat off) For example reiser3 based systems get pretty laggy on significant reniced load (even with the vanilla scheduler) if CONFIG_PREEMPT_BKL is enabled: reiser3 holds the BKL for extended periods of time so a "make -j50" workload can starve it significantly and the tty layer's BKL use makes any sort of keyboard (even over ssh) input laggy. Other locks though are not held this frequently and the mutex implementation is pretty fair for waiters anyway. (the semaphore implementation is not nearly as much fair, and the Big Kernel Semaphore is still struct semaphore based) So i'd really wait for specific workloads to trigger problems, and _maybe_ convert certain mutexes to rtmutexes, on an as-needed basis. > > In any case, it is clear that rq->raw_cpu_load should be used instead of > > rq->nr_running, when calculating the fair clock, but i begin to like the > > nice_offset solution too in addition of this: it's effective in practice > > and starvation-free in theory, and most importantly, it's very simple. > > We could even make the nice offset granularity tunable, just in case > > anyone wants to weaken (or strengthen) the effectivity of nice levels. > > What do you think, can you see any obvious (or less obvious) > > showstoppers with this approach? > > ->nice_offset's semantics are not meaningful to the end user, > regardless of whether it's effective. [...] yeah, agreed. That's one reason why i didnt make it tunable, it's pretty meaningless to the user. > [...] If there is something to be tuned, it should be relative shares > of CPU bandwidth (load_weight) corresponding to each nice level or > something else directly observable. The implementation could be > ->nice_offset, if it suffices. > > Suppose a table of nice weights like the following is tuned via > /proc/: > > -20 21 0 1 > -1 2 19 0.0476 > Essentially 1/(n+1) when n >= 0 and 1-n when n < 0. ok, thanks for thinking about it. I have changed the nice weight in CVSv5-to-be so that it defaults to something pretty close to your suggestion: the ratio between a nice 0 loop and a nice 19 loop is now set to about 2%. (This something that users requested for some time, the default ~5% is a tad high when running reniced SETI jobs, etc.) the actual percentage scales almost directly with the nice offset granularity value, but if this should be exposed to users at all, i agree that it would be better to directly expose this as some sort of 'ratio between nice 0 and nice 19 tasks', right? Or some other, more finegrained metric. Percentile is too coarse i think, and using 0.1% units isnt intuitive enough i think. The sysctl handler would then transform that 'human readable' sysctl value into the appropriate internal nice-offset-granularity value (or whatever mechanism the implementation ends up using). I'd not do this as a per-nice-level thing but as a single value that rescales the whole nice level range at once. That's alot less easy to misconfigure and we've got enough nice levels for users to pick from almost arbitrarily, as long as they have the ability to influence the max. does this sound mostly OK to you? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote: > In practice they can starve a bit when one renices thousands of tasks, > so i was thinking about the following special-case: to at least make > them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task > then we could 'share' its p->wait_runtime with that nice 19 task and > copy the signal sender's nice_offset. This would in essence pass the > right to execute over to the killed task, so that it can tear itself > down. > This cannot be used to gain an 'unfair advantage' because the signal > sender spends its own 'right to execute on the CPU', and because the > target task cannot execute any user code anymore when it gets a SIGKILL. I suppose this is a special case of the dreaded priority inversion. What of, say, nice 19 tasks holding fs semaphores and/or mutexes that nice -19 tasks are waiting to acquire? Perhaps rt_mutex should be the default mutex implementation. On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote: > In any case, it is clear that rq->raw_cpu_load should be used instead of > rq->nr_running, when calculating the fair clock, but i begin to like the > nice_offset solution too in addition of this: it's effective in practice > and starvation-free in theory, and most importantly, it's very simple. > We could even make the nice offset granularity tunable, just in case > anyone wants to weaken (or strengthen) the effectivity of nice levels. > What do you think, can you see any obvious (or less obvious) > showstoppers with this approach? ->nice_offset's semantics are not meaningful to the end user, regardless of whether it's effective. If there is something to be tuned, it should be relative shares of CPU bandwidth (load_weight) corresponding to each nice level or something else directly observable. The implementation could be ->nice_offset, if it suffices. Suppose a table of nice weights like the following is tuned via /proc/: -20 21 0 1 -19 20 1 0.5 -18 19 2 0. -17 18 3 0.25 -16 17 4 0.2 -15 16 5 0.1666 -14 15 6 0.1429 -13 14 7 0.125 -12 13 8 0. -11 12 9 0.1 -10 11 10 0.0909 -9 10 11 0.0833 -8 9 12 0.0769 -7 8 13 0.0714 -6 7 14 0.0666 -5 6 15 0.0625 -4 5 16 0.0588 -3 4 17 0.0555 -2 3 18 0.0526 -1 2 19 0.0476 Essentially 1/(n+1) when n >= 0 and 1-n when n < 0. Now suppose two CPU hogs, one being nice 1 and the other nice -1, compete against each other. The nice 1 hog should receive 0.5/(2+0.5) = 20% of the CPU, and the nice -1 hog should receive 80% of the CPU. A nice -5 hog vs. a nice 0 hog should have the nice 0 hog receive 1/(6+1) = 14.29% of the CPU and the nice -5 hog receive 85.71% of the CPU. Three hogs, once nice -1, one nice 0, and one nice 1 should have the nice -1 hog receive 2/(2+1+0.5) = 57.14% of the CPU, the nice 0 hog receive 1/(2+1+0.5) = 28.57% of the CPU, and the nice 1 hog receive 0.5/(2+1+0.5) = 14.29% of the CPU. And so on. (Note that these are very strong nice semantics, probably too strong for real use.) For five hogs with nice numbers ranging from -2 to 2, we'd get: -2 43.90% -1 29.27% 0 14.63% 1 7.32% 2 4.88% However this is implemented doesn't matter so long as the semantics are honored. I wouldn't specify too much beyond competing tasks with equivalent sleep/wakeup behavior; one may wish to apply bonuses and penalties to differing sleep/wakeup behavior in the future, even if such is not now done. Basically, the effect of nice numbers should be predictable somehow, and frankly, making it specifiable by the user resolves the issue of conflicting requirements. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Peter Williams wrote: Ingo Molnar wrote: your suggestion concentrates on the following scenario: if a task happens to schedule in an 'unlucky' way and happens to hit a busy period while there are many idle periods. Unless i misunderstood your suggestion, that is the main intention behind it, correct? You misunderstand (that's one of my other schedulers :-)). This one's based on the premise that if everything happens as the task expects it will get the amount of CPU bandwidth (over this short period) that it's entitled to. In reality, sometimes it will get more and sometimes less but on average it should get what it deserves. E.g. If you had two tasks with equal nice and both had demands of 90% of a CPU you'd expect them each to get about half of the CPU bandwidth. Now suppose that one of them uses 5ms of CPU each time it got onto the CPU and the other uses 10ms. If these two tasks just round robin with each other the likely outcome is that the one with the 10ms bursts will get twice as much CPU as the other but my proposed method should prevent and cause them to get roughly the same amount of CPU. (I believe this was a scenario that caused problems with O(1) and required a fix at some stage?) Another advantage of this mechanism is that, all else being equal, it will tend to run tasks that use short bursts of CPU ahead of those that use long bursts and this tends to reduce the overall time spent waiting for CPU by all tasks on the system which is good for throughput. I.e. in general, a task that tends to use short bursts of CPU will make other tasks wait less time than will one that tends to use long bursts. So this means that you were right and it is good in the scenario that you suggested even though that wasn't the motivation behind the design. This means that this scheduler should be good for improving latency on servers that aren't fully loaded as well as providing good fairness and responsiveness when the system is fully loaded. (Fingers crossed.) Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
* Peter Williams <[EMAIL PROTECTED]> wrote: > I retract this suggestion as it's a very bad idea. It introduces the > possibility of starvation via the poor sods at the bottom of the queue > having their "on CPU" forever postponed and we all know that even the > smallest possibility of starvation will eventually cause problems. > > I think there should be a rule: Once a task is on the queue its "on > CPU" time is immutable. Yeah, fully agreed. Currently i'm using the simple method of p->nice_offset, which plainly just moves the per nice level areas of the tree far enough apart (by a constant offset) so that lower nice levels rarely interact with higher nice levels. Lower nice levels never truly starve because rq->fair_clock increases deterministically and currently the fair_key values are indeed 'immutable' as you suggest. In practice they can starve a bit when one renices thousands of tasks, so i was thinking about the following special-case: to at least make them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task then we could 'share' its p->wait_runtime with that nice 19 task and copy the signal sender's nice_offset. This would in essence pass the right to execute over to the killed task, so that it can tear itself down. This cannot be used to gain an 'unfair advantage' because the signal sender spends its own 'right to execute on the CPU', and because the target task cannot execute any user code anymore when it gets a SIGKILL. In any case, it is clear that rq->raw_cpu_load should be used instead of rq->nr_running, when calculating the fair clock, but i begin to like the nice_offset solution too in addition of this: it's effective in practice and starvation-free in theory, and most importantly, it's very simple. We could even make the nice offset granularity tunable, just in case anyone wants to weaken (or strengthen) the effectivity of nice levels. What do you think, can you see any obvious (or less obvious) showstoppers with this approach? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: William Lee Irwin III wrote: William Lee Irwin III wrote: On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: If some form of precise timer was used (instead) to trigger pre-emption then, where there is more than one task with the same expected "on CPU" time, only the last would get any CPU (and that's only the case if some infinite loop doesn't get created). I think you can check for this and adjust their on-cpu times in various ways and choose what order to run them in. I think it'd end up something of a missed deadline policy. Arguably the latest one should go first as its estimate is more likely to be correct being based on more recent info i.e. the other's estimates would be based on less runnable tasks and be optimistic. This would involve pushing them down the queue and pushing items already on the queue downwards is likely to be expensive. Which is why I prefer the less elegant solution of the periodical interrupt. Of course, one solution might be to just add the adjustment to the key time on all tasks on the queue? I retract this suggestion as it's a very bad idea. It introduces the possibility of starvation via the poor sods at the bottom of the queue having their "on CPU" forever postponed and we all know that even the smallest possibility of starvation will eventually cause problems. I think there should be a rule: Once a task is on the queue its "on CPU" time is immutable. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: William Lee Irwin III wrote: William Lee Irwin III wrote: On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: If some form of precise timer was used (instead) to trigger pre-emption then, where there is more than one task with the same expected on CPU time, only the last would get any CPU (and that's only the case if some infinite loop doesn't get created). I think you can check for this and adjust their on-cpu times in various ways and choose what order to run them in. I think it'd end up something of a missed deadline policy. Arguably the latest one should go first as its estimate is more likely to be correct being based on more recent info i.e. the other's estimates would be based on less runnable tasks and be optimistic. This would involve pushing them down the queue and pushing items already on the queue downwards is likely to be expensive. Which is why I prefer the less elegant solution of the periodical interrupt. Of course, one solution might be to just add the adjustment to the key time on all tasks on the queue? I retract this suggestion as it's a very bad idea. It introduces the possibility of starvation via the poor sods at the bottom of the queue having their on CPU forever postponed and we all know that even the smallest possibility of starvation will eventually cause problems. I think there should be a rule: Once a task is on the queue its on CPU time is immutable. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
* Peter Williams [EMAIL PROTECTED] wrote: I retract this suggestion as it's a very bad idea. It introduces the possibility of starvation via the poor sods at the bottom of the queue having their on CPU forever postponed and we all know that even the smallest possibility of starvation will eventually cause problems. I think there should be a rule: Once a task is on the queue its on CPU time is immutable. Yeah, fully agreed. Currently i'm using the simple method of p-nice_offset, which plainly just moves the per nice level areas of the tree far enough apart (by a constant offset) so that lower nice levels rarely interact with higher nice levels. Lower nice levels never truly starve because rq-fair_clock increases deterministically and currently the fair_key values are indeed 'immutable' as you suggest. In practice they can starve a bit when one renices thousands of tasks, so i was thinking about the following special-case: to at least make them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task then we could 'share' its p-wait_runtime with that nice 19 task and copy the signal sender's nice_offset. This would in essence pass the right to execute over to the killed task, so that it can tear itself down. This cannot be used to gain an 'unfair advantage' because the signal sender spends its own 'right to execute on the CPU', and because the target task cannot execute any user code anymore when it gets a SIGKILL. In any case, it is clear that rq-raw_cpu_load should be used instead of rq-nr_running, when calculating the fair clock, but i begin to like the nice_offset solution too in addition of this: it's effective in practice and starvation-free in theory, and most importantly, it's very simple. We could even make the nice offset granularity tunable, just in case anyone wants to weaken (or strengthen) the effectivity of nice levels. What do you think, can you see any obvious (or less obvious) showstoppers with this approach? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Peter Williams wrote: Ingo Molnar wrote: your suggestion concentrates on the following scenario: if a task happens to schedule in an 'unlucky' way and happens to hit a busy period while there are many idle periods. Unless i misunderstood your suggestion, that is the main intention behind it, correct? You misunderstand (that's one of my other schedulers :-)). This one's based on the premise that if everything happens as the task expects it will get the amount of CPU bandwidth (over this short period) that it's entitled to. In reality, sometimes it will get more and sometimes less but on average it should get what it deserves. E.g. If you had two tasks with equal nice and both had demands of 90% of a CPU you'd expect them each to get about half of the CPU bandwidth. Now suppose that one of them uses 5ms of CPU each time it got onto the CPU and the other uses 10ms. If these two tasks just round robin with each other the likely outcome is that the one with the 10ms bursts will get twice as much CPU as the other but my proposed method should prevent and cause them to get roughly the same amount of CPU. (I believe this was a scenario that caused problems with O(1) and required a fix at some stage?) Another advantage of this mechanism is that, all else being equal, it will tend to run tasks that use short bursts of CPU ahead of those that use long bursts and this tends to reduce the overall time spent waiting for CPU by all tasks on the system which is good for throughput. I.e. in general, a task that tends to use short bursts of CPU will make other tasks wait less time than will one that tends to use long bursts. So this means that you were right and it is good in the scenario that you suggested even though that wasn't the motivation behind the design. This means that this scheduler should be good for improving latency on servers that aren't fully loaded as well as providing good fairness and responsiveness when the system is fully loaded. (Fingers crossed.) Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote: In practice they can starve a bit when one renices thousands of tasks, so i was thinking about the following special-case: to at least make them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task then we could 'share' its p-wait_runtime with that nice 19 task and copy the signal sender's nice_offset. This would in essence pass the right to execute over to the killed task, so that it can tear itself down. This cannot be used to gain an 'unfair advantage' because the signal sender spends its own 'right to execute on the CPU', and because the target task cannot execute any user code anymore when it gets a SIGKILL. I suppose this is a special case of the dreaded priority inversion. What of, say, nice 19 tasks holding fs semaphores and/or mutexes that nice -19 tasks are waiting to acquire? Perhaps rt_mutex should be the default mutex implementation. On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote: In any case, it is clear that rq-raw_cpu_load should be used instead of rq-nr_running, when calculating the fair clock, but i begin to like the nice_offset solution too in addition of this: it's effective in practice and starvation-free in theory, and most importantly, it's very simple. We could even make the nice offset granularity tunable, just in case anyone wants to weaken (or strengthen) the effectivity of nice levels. What do you think, can you see any obvious (or less obvious) showstoppers with this approach? -nice_offset's semantics are not meaningful to the end user, regardless of whether it's effective. If there is something to be tuned, it should be relative shares of CPU bandwidth (load_weight) corresponding to each nice level or something else directly observable. The implementation could be -nice_offset, if it suffices. Suppose a table of nice weights like the following is tuned via /proc/: -20 21 0 1 -19 20 1 0.5 -18 19 2 0. -17 18 3 0.25 -16 17 4 0.2 -15 16 5 0.1666 -14 15 6 0.1429 -13 14 7 0.125 -12 13 8 0. -11 12 9 0.1 -10 11 10 0.0909 -9 10 11 0.0833 -8 9 12 0.0769 -7 8 13 0.0714 -6 7 14 0.0666 -5 6 15 0.0625 -4 5 16 0.0588 -3 4 17 0.0555 -2 3 18 0.0526 -1 2 19 0.0476 Essentially 1/(n+1) when n = 0 and 1-n when n 0. Now suppose two CPU hogs, one being nice 1 and the other nice -1, compete against each other. The nice 1 hog should receive 0.5/(2+0.5) = 20% of the CPU, and the nice -1 hog should receive 80% of the CPU. A nice -5 hog vs. a nice 0 hog should have the nice 0 hog receive 1/(6+1) = 14.29% of the CPU and the nice -5 hog receive 85.71% of the CPU. Three hogs, once nice -1, one nice 0, and one nice 1 should have the nice -1 hog receive 2/(2+1+0.5) = 57.14% of the CPU, the nice 0 hog receive 1/(2+1+0.5) = 28.57% of the CPU, and the nice 1 hog receive 0.5/(2+1+0.5) = 14.29% of the CPU. And so on. (Note that these are very strong nice semantics, probably too strong for real use.) For five hogs with nice numbers ranging from -2 to 2, we'd get: -2 43.90% -1 29.27% 0 14.63% 1 7.32% 2 4.88% However this is implemented doesn't matter so long as the semantics are honored. I wouldn't specify too much beyond competing tasks with equivalent sleep/wakeup behavior; one may wish to apply bonuses and penalties to differing sleep/wakeup behavior in the future, even if such is not now done. Basically, the effect of nice numbers should be predictable somehow, and frankly, making it specifiable by the user resolves the issue of conflicting requirements. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
* William Lee Irwin III [EMAIL PROTECTED] wrote: I suppose this is a special case of the dreaded priority inversion. What of, say, nice 19 tasks holding fs semaphores and/or mutexes that nice -19 tasks are waiting to acquire? Perhaps rt_mutex should be the default mutex implementation. while i agree that it could be an issue, lock inversion is nothing really new, so i'd not go _that_ drastic to convert all mutexes to rtmutexes. (i've taken my -rt/PREEMPT_RT hat off) For example reiser3 based systems get pretty laggy on significant reniced load (even with the vanilla scheduler) if CONFIG_PREEMPT_BKL is enabled: reiser3 holds the BKL for extended periods of time so a make -j50 workload can starve it significantly and the tty layer's BKL use makes any sort of keyboard (even over ssh) input laggy. Other locks though are not held this frequently and the mutex implementation is pretty fair for waiters anyway. (the semaphore implementation is not nearly as much fair, and the Big Kernel Semaphore is still struct semaphore based) So i'd really wait for specific workloads to trigger problems, and _maybe_ convert certain mutexes to rtmutexes, on an as-needed basis. In any case, it is clear that rq-raw_cpu_load should be used instead of rq-nr_running, when calculating the fair clock, but i begin to like the nice_offset solution too in addition of this: it's effective in practice and starvation-free in theory, and most importantly, it's very simple. We could even make the nice offset granularity tunable, just in case anyone wants to weaken (or strengthen) the effectivity of nice levels. What do you think, can you see any obvious (or less obvious) showstoppers with this approach? -nice_offset's semantics are not meaningful to the end user, regardless of whether it's effective. [...] yeah, agreed. That's one reason why i didnt make it tunable, it's pretty meaningless to the user. [...] If there is something to be tuned, it should be relative shares of CPU bandwidth (load_weight) corresponding to each nice level or something else directly observable. The implementation could be -nice_offset, if it suffices. Suppose a table of nice weights like the following is tuned via /proc/: -20 21 0 1 -1 2 19 0.0476 Essentially 1/(n+1) when n = 0 and 1-n when n 0. ok, thanks for thinking about it. I have changed the nice weight in CVSv5-to-be so that it defaults to something pretty close to your suggestion: the ratio between a nice 0 loop and a nice 19 loop is now set to about 2%. (This something that users requested for some time, the default ~5% is a tad high when running reniced SETI jobs, etc.) the actual percentage scales almost directly with the nice offset granularity value, but if this should be exposed to users at all, i agree that it would be better to directly expose this as some sort of 'ratio between nice 0 and nice 19 tasks', right? Or some other, more finegrained metric. Percentile is too coarse i think, and using 0.1% units isnt intuitive enough i think. The sysctl handler would then transform that 'human readable' sysctl value into the appropriate internal nice-offset-granularity value (or whatever mechanism the implementation ends up using). I'd not do this as a per-nice-level thing but as a single value that rescales the whole nice level range at once. That's alot less easy to misconfigure and we've got enough nice levels for users to pick from almost arbitrarily, as long as they have the ability to influence the max. does this sound mostly OK to you? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: I retract this suggestion as it's a very bad idea. It introduces the possibility of starvation via the poor sods at the bottom of the queue having their on CPU forever postponed and we all know that even the smallest possibility of starvation will eventually cause problems. I think there should be a rule: Once a task is on the queue its on CPU time is immutable. Yeah, fully agreed. Currently i'm using the simple method of p-nice_offset, which plainly just moves the per nice level areas of the tree far enough apart (by a constant offset) so that lower nice levels rarely interact with higher nice levels. Lower nice levels never truly starve because rq-fair_clock increases deterministically and currently the fair_key values are indeed 'immutable' as you suggest. In practice they can starve a bit when one renices thousands of tasks, so i was thinking about the following special-case: to at least make them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task then we could 'share' its p-wait_runtime with that nice 19 task and copy the signal sender's nice_offset. This would in essence pass the right to execute over to the killed task, so that it can tear itself down. This cannot be used to gain an 'unfair advantage' because the signal sender spends its own 'right to execute on the CPU', and because the target task cannot execute any user code anymore when it gets a SIGKILL. In any case, it is clear that rq-raw_cpu_load should be used instead of rq-nr_running, when calculating the fair clock, but i begin to like the nice_offset solution too in addition of this: it's effective in practice and starvation-free in theory, and most importantly, it's very simple. We could even make the nice offset granularity tunable, just in case anyone wants to weaken (or strengthen) the effectivity of nice levels. What do you think, can you see any obvious (or less obvious) showstoppers with this approach? I haven't had a close look at it but from the above description it sounds an order of magnitude more complex than I thought it would be. The idea of different nice levels sounds like a recipe for starvation to me (if it works the way it sounds like it works). I guess I'll have to spend more time reading the code because I don't seem to be able to make sense of the above description in any way that doesn't say starvation here we come. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: I retract this suggestion as it's a very bad idea. It introduces the possibility of starvation via the poor sods at the bottom of the queue having their on CPU forever postponed and we all know that even the smallest possibility of starvation will eventually cause problems. I think there should be a rule: Once a task is on the queue its on CPU time is immutable. Yeah, fully agreed. Currently i'm using the simple method of p-nice_offset, which plainly just moves the per nice level areas of the tree far enough apart (by a constant offset) so that lower nice levels rarely interact with higher nice levels. Lower nice levels never truly starve because rq-fair_clock increases deterministically and currently the fair_key values are indeed 'immutable' as you suggest. In practice they can starve a bit when one renices thousands of tasks, so i was thinking about the following special-case: to at least make them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task then we could 'share' its p-wait_runtime with that nice 19 task and copy the signal sender's nice_offset. This would in essence pass the right to execute over to the killed task, so that it can tear itself down. This cannot be used to gain an 'unfair advantage' because the signal sender spends its own 'right to execute on the CPU', and because the target task cannot execute any user code anymore when it gets a SIGKILL. In any case, it is clear that rq-raw_cpu_load should be used instead of rq-nr_running, when calculating the fair clock, but i begin to like the nice_offset solution too in addition of this: it's effective in practice and starvation-free in theory, and most importantly, it's very simple. We could even make the nice offset granularity tunable, just in case anyone wants to weaken (or strengthen) the effectivity of nice levels. What do you think, can you see any obvious (or less obvious) showstoppers with this approach? I haven't had a close look at it but from the above description it sounds an order of magnitude more complex than I thought it would be. The idea of different nice levels sounds like a recipe for starvation to me (if it works the way it sounds like it works). I guess I'll have to spend more time reading the code because I don't seem to be able to make sense of the above description in any way that doesn't say starvation here we come. I'm finding it hard to figure out what the underling principle for the way you're queuing things by reading the code (that's the trouble with reading the code it just tells you what's being done not why -- and sometimes it's even hard to figure out what's being done when there's a lot of indirection). sched-design-CFS.txt isn't much help in this area either. Any chance of a brief description of how it's supposed to work? Key questions are: How do you decide the key value for a task's position in the queue? Is it an absolute time or an offset from the current time? How do you decide when to boot the current task of the queue? Both at wake up of another task and in general play? Peter PS I think that you're trying to do too much in one patch. -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
* William Lee Irwin III [EMAIL PROTECTED] wrote: Suppose a table of nice weights like the following is tuned via /proc/: -20 21 0 1 -1 2 19 0.0476 Essentially 1/(n+1) when n = 0 and 1-n when n 0. On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote: ok, thanks for thinking about it. I have changed the nice weight in CVSv5-to-be so that it defaults to something pretty close to your suggestion: the ratio between a nice 0 loop and a nice 19 loop is now set to about 2%. (This something that users requested for some time, the default ~5% is a tad high when running reniced SETI jobs, etc.) Okay. Maybe what I suggested is too weak vs. too strong. I didn't actually have it in mind as a proposal for general use, but maybe it is good for such. I had more in mind tunability in general, but it's all good. I'd think some curve gentler in intermediate nice levels and stronger at the tails might be better. On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote: the actual percentage scales almost directly with the nice offset granularity value, but if this should be exposed to users at all, i agree that it would be better to directly expose this as some sort of 'ratio between nice 0 and nice 19 tasks', right? Or some other, more finegrained metric. Percentile is too coarse i think, and using 0.1% units isnt intuitive enough i think. The sysctl handler would then transform that 'human readable' sysctl value into the appropriate internal nice-offset-granularity value (or whatever mechanism the implementation ends up using). I vaguely liked specifying the full table, but maybe it's too much for a real user interface. 4-digit or 5-digit fixed point decimal sounds reasonable. On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote: I'd not do this as a per-nice-level thing but as a single value that rescales the whole nice level range at once. That's alot less easy to misconfigure and we've got enough nice levels for users to pick from almost arbitrarily, as long as they have the ability to influence the max. does this sound mostly OK to you? For the most part, yes. I've been mostly looking at how effectively the prioritization algorithms work. I'll be wrapping up writing a testcase to measure all this soon. The basic idea is to take the weights as inputs somehow and then check to see that they're honored. What's appropriate for end-users is a very different thing from what might be appropriate for me. I won't have trouble fiddling with the code, so please do design around what the best interface for end-users might be. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
William Lee Irwin III wrote: William Lee Irwin III wrote: This essentially doesn't look correct because while you want to enforce the CPU bandwidth allocation, this doesn't have much to do with that apart from the CPU bandwidth appearing as a term. It's more properly a rate of service as opposed to a time at which anything should happen or a number useful for predicting such. When service should begin more properly depends on the other tasks in the system and a number of other decisions that are part of the scheduling policy. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: This model takes all of those into consideration. The idea is not just to predict but to use the calculated time to decide when to boot the current process of the CPU (if it doesn't leave voluntarily) and put this one on. This more or less removes the need to give each task a predetermined chunk of CPU when they go on to the CPU. This should, in general, reduce the number context switches as tasks get to run until they've finished what they're doing or another task becomes higher priority rather than being booted off after an arbitrary time interval. (If this ever gets tried it will be interesting to see if this prediction comes true.) BTW Even if Ingo doesn't choose to try this model, I'll probably make a patch (way in the future after Ingo's changes are settled) to try it out myself. I think I smoked out what you were doing. William Lee Irwin III wrote: If you want to choose a "quasi-inter-arrival time" to achieve the specified CPU bandwidth allocation, this would be it, but to use that to actually enforce the CPU bandwidth allocation, you would need to take into account the genuine inter-arrival time to choose an actual time for service to begin. In other words, this should be a quota for the task to have waited. If it's not waited long enough, then it should be delayed by the difference to achieve the inter-arrival time you're trying to enforce. If it's waited longer, it should be executed sooner modulo other constraints, and perhaps even credited for future scheduling cycles. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: The idea isn't to enforce the bandwidth entitlement to the extent of throttling tasks if they exceed their entitlement and there's no other tasks ready to use the CPU. This is mainly because the bandwidth entitlement isn't fixed -- it's changing constantly as the number and type of runnable tasks changes. Well, a little hysteresis will end up throttling in such a manner anyway as a side-effect, Think of this as a calming influence :-) or you'll get anomalies. Say two tasks with equal entitlements compete, where one sleeps for 1/3 of the time and the other is fully CPU-bound. If only the times when they're in direct competition are split 50/50, then the CPU-bound task gets 2/3 and the sleeper 1/3, which is not the intended effect. I don't believe this model will be very vulnerable to it, though. Nor me. William Lee Irwin III wrote: In order to partially avoid underallocating CPU bandwidth to p, one should track the time last spent sleeping and do the following: On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: Yes I made a mistake in omitting to take into account sleep interval. See another e-mail to Ingo correcting this problem. I took it to be less trivial of an error than it was. No big deal. No, you were right it was definitely a NON trivial error. William Lee Irwin III wrote: In order to do better, longer-term history is required, On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: The half life of the Kalman filter (roughly equivalent to a running average) used to calculate the averages determines how much history is taken into account. It could be made configurable (at least, until enough real life experience was available to decide on the best value to use). A Kalman filter would do better than a running average. I'm all for it. As a long time user of Kalman filters I tend to think of them as the same thing. I use the term running average when talking about the idea behind a scheduler because I think that more people will understand what the general idea is. When it comes to implementation I always replace the idea of "running average" with a roughly equivalent Kalman filter. William Lee Irwin III wrote: To attempt to maintain an infinite history of bandwidth underutilization to be credited too far in the future would enable potentially long-term overutilization when such credits are cashed en masse for a sustained period of time. At some point you have to say "use it or lose it;" over a shorter period of time some smoothing is still admissible and even desirable. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: Yes, that's why I suggest a running average over the last few scheduling cycles for the task. But thinking about it some more I'm now not so
Re: [patch] CFS scheduler, v3
William Lee Irwin III wrote: >> This essentially doesn't look correct because while you want to enforce >> the CPU bandwidth allocation, this doesn't have much to do with that >> apart from the CPU bandwidth appearing as a term. It's more properly >> a rate of service as opposed to a time at which anything should happen >> or a number useful for predicting such. When service should begin more >> properly depends on the other tasks in the system and a number of other >> decisions that are part of the scheduling policy. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: > This model takes all of those into consideration. The idea is not just > to predict but to use the calculated time to decide when to boot the > current process of the CPU (if it doesn't leave voluntarily) and put > this one on. This more or less removes the need to give each task a > predetermined chunk of CPU when they go on to the CPU. This should, in > general, reduce the number context switches as tasks get to run until > they've finished what they're doing or another task becomes higher > priority rather than being booted off after an arbitrary time interval. > (If this ever gets tried it will be interesting to see if this > prediction comes true.) > BTW Even if Ingo doesn't choose to try this model, I'll probably make a > patch (way in the future after Ingo's changes are settled) to try it out > myself. I think I smoked out what you were doing. William Lee Irwin III wrote: >> If you want to choose a "quasi-inter-arrival time" to achieve the >> specified CPU bandwidth allocation, this would be it, but to use that >> to actually enforce the CPU bandwidth allocation, you would need to >> take into account the genuine inter-arrival time to choose an actual >> time for service to begin. In other words, this should be a quota for >> the task to have waited. If it's not waited long enough, then it should >> be delayed by the difference to achieve the inter-arrival time you're >> trying to enforce. If it's waited longer, it should be executed >> sooner modulo other constraints, and perhaps even credited for future >> scheduling cycles. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: > The idea isn't to enforce the bandwidth entitlement to the extent of > throttling tasks if they exceed their entitlement and there's no other > tasks ready to use the CPU. This is mainly because the bandwidth > entitlement isn't fixed -- it's changing constantly as the number and > type of runnable tasks changes. Well, a little hysteresis will end up throttling in such a manner anyway as a side-effect, or you'll get anomalies. Say two tasks with equal entitlements compete, where one sleeps for 1/3 of the time and the other is fully CPU-bound. If only the times when they're in direct competition are split 50/50, then the CPU-bound task gets 2/3 and the sleeper 1/3, which is not the intended effect. I don't believe this model will be very vulnerable to it, though. William Lee Irwin III wrote: >> In order to partially avoid underallocating CPU bandwidth to p, one >> should track the time last spent sleeping and do the following: On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: > Yes I made a mistake in omitting to take into account sleep interval. > See another e-mail to Ingo correcting this problem. I took it to be less trivial of an error than it was. No big deal. William Lee Irwin III wrote: >> In order to do better, longer-term history is required, On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: > The half life of the Kalman filter (roughly equivalent to a running > average) used to calculate the averages determines how much history is > taken into account. It could be made configurable (at least, until > enough real life experience was available to decide on the best value to > use). A Kalman filter would do better than a running average. I'm all for it. William Lee Irwin III wrote: >> To attempt to maintain an infinite history of >> bandwidth underutilization to be credited too far in the future would >> enable potentially long-term overutilization when such credits are >> cashed en masse for a sustained period of time. At some point you have >> to say "use it or lose it;" over a shorter period of time some smoothing >> is still admissible and even desirable. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: > Yes, that's why I suggest a running average over the last few scheduling > cycles for the task. But thinking about it some more I'm now not so > sure. The lack of apparent "smoothness" when I've done this sort of > thing with raw rather than smooth data (in modifications to the current > dynamic priority based scheduler model) is usually noticed by running > top and seeing wildly fluctuating dynamic priorities. I'm not sure that > the actual responsiveness of the system reflects this. So I'm now > willing to reserve my
Re: [patch] CFS scheduler, v3
William Lee Irwin III wrote: On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. Hmm. Let me take a look... On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: First suppose we have the following metrics available in addition to what's already provided. rq->avg_weight_load /* a running average of the weighted load on the CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ I'm suspicious of mean service times not paired with mean inter-arrival times. On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p->avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that "scheduling cycle" would be construed as some mechanism being imposed on the scheduler.) I went and looked up priority queueing queueing theory garbage and re-derived various things I needed. The basics check out. Probably no one cares that I checked. On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: We can then define: effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq->raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / p->load_weight You're right. The time that the task spent sleeping before being woken should be subtracted from this value. If the answer is less than or equal to zero pre-emption should occur. This doesn't look right, probably because the scaling factor of p->avg_cpu_per_cycle is the reciprocal of its additive contribution to the ->avg_weight_load as opposed to a direct estimate of its initial delay or waiting time before completing its current requested service. p->load_weight/effective_weighted_load more properly represents an entitlement to CPU bandwidth. Yes. But expected_wait isn't entitlement it's its inverse. p->avg_cpu_per_cycle/(p->load_weight/effective_weighted_load) would be more like the expected time spent on the runqueue When I went to school that would be just another way of expressing the equation that I expressed. (whether waiting to run or actually running) for a time interval spent in a runnable state and the expected time runnable and waiting to run in such an interval would be p->avg_cpu_per_cycle*(1-effective_weighted_load/p->load_weight), Neither represents the initial delay between entering the runqeueue and first acquiring the CPU, but that's a bit hard to figure out without deciding the scheduling policy up-front anyway. This essentially doesn't look correct because while you want to enforce the CPU bandwidth allocation, this doesn't have much to do with that apart from the CPU bandwidth appearing as a term. It's more properly a rate of service as opposed to a time at which anything should happen or a number useful for predicting such. When service should begin more properly depends on the other tasks in the system and a number of other decisions that are part of the scheduling policy. This model takes all of those into consideration. The idea is not just to predict but to use the calculated time to decide when to boot the current process of the CPU (if it doesn't leave voluntarily) and put this one on. This more or less removes the need to give each task a predetermined chunk of CPU when they go on to the CPU. This should, in general, reduce the number context switches as tasks get to run until they've finished what they're doing or another task becomes higher priority rather than being booted off after an arbitrary time interval. (If this ever gets tried it will be interesting to see if this prediction comes true.) BTW Even if Ingo doesn't choose to try this model, I'll probably make a patch (way in the future after Ingo's changes are settled) to try it out myself. If you want to choose a "quasi-inter-arrival time" to achieve the specified CPU bandwidth allocation, this would be it, but to use that to actually enforce the CPU bandwidth allocation, you would
Re: [patch] CFS scheduler, v3
On Fri, 20 Apr 2007, William Lee Irwin III wrote: >> I'm not really convinced it's all that worthwhile of an optimization, >> essentially for the same reasons as you, but presumably there's a >> benchmark result somewhere that says it matters. I've just not seen it. On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote: > If it is true that we frequently remotely write the per cpu runqueue > data then we may have a NUMA scalability issue. >From the discussion on Suresh's thread, it appears to have sped up a database benchmark 0.5%. Last I checked it was workload-dependent, but there were things that hammer it. I mostly know of the remote wakeup issue, but there could be other things besides wakeups that do it, too. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, 20 Apr 2007, William Lee Irwin III wrote: > I'm not really convinced it's all that worthwhile of an optimization, > essentially for the same reasons as you, but presumably there's a > benchmark result somewhere that says it matters. I've just not seen it. If it is true that we frequently remotely write the per cpu runqueue data then we may have a NUMA scalability issue. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote: >>> False sharing for a per cpu data structure? Are we updating that >>> structure from other processors? On Fri, 20 Apr 2007, William Lee Irwin III wrote: >> Primarily in the load balancer, but also in wakeups. On Fri, Apr 20, 2007 at 12:33:13PM -0700, Christoph Lameter wrote: > That is fairly rare I think. What other variables that are also writtten > frequently would cause false sharing? I wrote that backward, sorry. Cross-CPU wakeups' frequency depend heavily on the workload. Probably the only other case I can think of is io_schedule() but that's not really significant. I'm not really convinced it's all that worthwhile of an optimization, essentially for the same reasons as you, but presumably there's a benchmark result somewhere that says it matters. I've just not seen it. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, 20 Apr 2007, William Lee Irwin III wrote: > On Wed, 18 Apr 2007, William Lee Irwin III wrote: > >> > >> Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. > > On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote: > > False sharing for a per cpu data structure? Are we updating that > > structure from other processors? > > Primarily in the load balancer, but also in wakeups. That is fairly rare I think. What other variables that are also writtten frequently would cause false sharing? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Wed, 18 Apr 2007, William Lee Irwin III wrote: >> >> Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote: > False sharing for a per cpu data structure? Are we updating that > structure from other processors? Primarily in the load balancer, but also in wakeups. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Wed, 18 Apr 2007, William Lee Irwin III wrote: > > Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. False sharing for a per cpu data structure? Are we updating that structure from other processors? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
* Peter Williams <[EMAIL PROTECTED]> wrote: > BTW Given that I'm right and dynamic priorities have been dispensed > with what do you intend exporting (in their place) to user space for > display by top and similar? well i thought of only displaying static ones, i.e. like the current patch does. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq->avg_weight_load /* a running average of the weighted load on the CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p->avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that "scheduling cycle" would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq->raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / p->load_weight I just realized that this is wrong for the case where p is being woken. In that case the length of the last sleep should be subtracted from the above value and if the result is negative pre-empt straight away. So expected_wait becomes: expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / p->load_weight - p->length_of_last_sleep As the length of the last sleep will be being calculated during the wake process there's no real need for it to be a task field and a local variable could be used instead. For a task being requeued during pre-emption the equation would be: expected_wait = time_just_spent_on_the_cpu * effective_weighted_load / p->load_weight as there was zero sleep since last time on CPU. Using the actual time on the CPU so far this time (instead of the average) will compensate the task for being pre-empted. Calculating p->avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If you don't like using the average CPU time per cycle, you could just use the length of time on the CPU for the last time the task was on the CPU. This would simplify things a lot. I've been thinking about the "jerkiness" (or lack of "smoothness") that I said would occur if smoothed averages weren't used and I've realised that what I was talking about was observed jerkiness in the dynamic priorities of the tasks. As this scheduler has dispensed with dynamic priorities (as far as I can see) the jerkiness probably won't be apparent. BTW Given that I'm right and dynamic priorities have been dispensed with what do you intend exporting (in their place) to user space for display by top and similar? Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: > I have a suggestion I'd like to make that addresses both nice and > fairness at the same time. As I understand the basic principle behind > this scheduler it to work out a time by which a task should make it onto > the CPU and then place it into an ordered list (based on this value) of > tasks waiting for the CPU. I think that this is a great idea and my > suggestion is with regard to a method for working out this time that > takes into account both fairness and nice. Hmm. Let me take a look... On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: > First suppose we have the following metrics available in addition to > what's already provided. > rq->avg_weight_load /* a running average of the weighted load on the CPU */ > p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the > CPU each scheduling cycle */ I'm suspicious of mean service times not paired with mean inter-arrival times. On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: > where a scheduling cycle for a task starts when it is placed on the > queue after waking or being preempted and ends when it is taken off the > CPU either voluntarily or after being preempted. So > p->avg_cpu_per_cycle is just the average amount of time p spends on the > CPU each time it gets on to the CPU. Sorry for the long explanation > here but I just wanted to make sure there was no chance that "scheduling > cycle" would be construed as some mechanism being imposed on the scheduler.) I went and looked up priority queueing queueing theory garbage and re-derived various things I needed. The basics check out. Probably no one cares that I checked. On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: > We can then define: > effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load) > If p is just waking (i.e. it's not on the queue and its load_weight is > not included in rq->raw_weighted_load) and we need to queue it, we say > that the maximum time (in all fairness) that p should have to wait to > get onto the CPU is: > expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / > p->load_weight This doesn't look right, probably because the scaling factor of p->avg_cpu_per_cycle is the reciprocal of its additive contribution to the ->avg_weight_load as opposed to a direct estimate of its initial delay or waiting time before completing its current requested service. p->load_weight/effective_weighted_load more properly represents an entitlement to CPU bandwidth. p->avg_cpu_per_cycle/(p->load_weight/effective_weighted_load) would be more like the expected time spent on the runqueue (whether waiting to run or actually running) for a time interval spent in a runnable state and the expected time runnable and waiting to run in such an interval would be p->avg_cpu_per_cycle*(1-effective_weighted_load/p->load_weight), Neither represents the initial delay between entering the runqeueue and first acquiring the CPU, but that's a bit hard to figure out without deciding the scheduling policy up-front anyway. This essentially doesn't look correct because while you want to enforce the CPU bandwidth allocation, this doesn't have much to do with that apart from the CPU bandwidth appearing as a term. It's more properly a rate of service as opposed to a time at which anything should happen or a number useful for predicting such. When service should begin more properly depends on the other tasks in the system and a number of other decisions that are part of the scheduling policy. If you want to choose a "quasi-inter-arrival time" to achieve the specified CPU bandwidth allocation, this would be it, but to use that to actually enforce the CPU bandwidth allocation, you would need to take into account the genuine inter-arrival time to choose an actual time for service to begin. In other words, this should be a quota for the task to have waited. If it's not waited long enough, then it should be delayed by the difference to achieve the inter-arrival time you're trying to enforce. If it's waited longer, it should be executed sooner modulo other constraints, and perhaps even credited for future scheduling cycles. In order to partially avoid underallocating CPU bandwidth to p, one should track the time last spent sleeping and do the following: last_sleep = now - p->last_went_to_sleep; wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight; wait = wait > last_sleep ? wait - last_sleep : 0; By and large the scheduler can deterministically choose waits on the runqueue but it doesn't actually do that. Some additional corrections for delays beyond those decided up-front while on the runqueue or getting scheduled early on the runqueue may also help, though I'm not as interested in them as I am the one for sleep: last_wait = now - p->last_taken_off_the_cpu; wait =
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Ingo Molnar wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea [...] yes, that's exactly the main idea behind CFS, and thanks for the compliment :) Under this concept the scheduler never really has to guess: every scheduler decision derives straight from the relatively simple one-sentence (!) scheduling concept outlined above. Everything that tasks 'get' is something they 'earned' before and all the scheduler does are micro-decisions based on math with the nanosec-granularity values. Both the rbtree and nanosec accounting are a straight consequence of this too: they are the tools that allow the implementation of this concept in the highest-quality way. It's certainly a very exciting experiment to me and the feedback 'from the field' is very promising so far. [...] and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq->avg_weight_load /* a running average of the weighted load on the CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ yes. rq->nr_running is really just a first-level approximation of rq->raw_weighted_load. I concentrated on the 'nice 0' case initially. I appreciate that the notion of basing the expected wait on the task's average cpu use per scheduling cycle is counter intuitive but I believe that (if you think about it) you'll see that it actually makes sense. hm. So far i tried to not do any statistical approach anywhere: the p->wait_runtime metric (which drives the task ordering) is in essence an absolutely precise 'integral' of the 'expected runtimes' that the task observes and hence is a precise "load-average as observed by the task" To me this is statistics :-) in itself. Every time we base some metric on an average value we introduce noise into the system. i definitely agree with your suggestion that CFS should use a nice-scaled metric for 'load' instead of the current rq->nr_running, but regarding the basic calculations i'd rather lean towards using rq->raw_weighted_load. Hm? This can result in jerkiness (in my experience) but using the smoothed version is certainly something that can be tried later rather than sooner. Perhaps just something to bear in mind as a solution to "jerkiness" if it manifests. your suggestion concentrates on the following scenario: if a task happens to schedule in an 'unlucky' way and happens to hit a busy period while there are many idle periods. Unless i misunderstood your suggestion, that is the main intention behind it, correct? You misunderstand (that's one of my other schedulers :-)). This one's based on the premise that if everything happens as the task expects it will get the amount of CPU bandwidth (over this short period) that it's entitled to. In reality, sometimes it will get more and sometimes less but on average it should get what it deserves. E.g. If you had two tasks with equal nice and both had demands of 90% of a CPU you'd expect them each to get about half of the CPU bandwidth. Now suppose that one of them uses 5ms of CPU each time it got onto the CPU and the other uses 10ms. If these two tasks just round robin with each other the likely outcome is that the one with the 10ms bursts will get twice as much CPU as the other but my proposed method should prevent and cause them to get roughly the same amount of CPU. (I believe this was a scenario that caused problems with O(1) and required a fix at some stage?) BTW this has the advantage that the decay rate used in calculating the task's statistics can be used to control how quickly the scheduler reacts to changes in the task's behaviour. I think that, with this model, if the current task hasn't surrendered the CPU when the next task on the queue's "on CPU" time arrives that the current task should be pre-empted in favour of that task. I'm not sure what would be the best way to implement this. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED]
Re: [patch] CFS scheduler, v3
Ingo Molnar wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea [...] yes, that's exactly the main idea behind CFS, and thanks for the compliment :) Under this concept the scheduler never really has to guess: every scheduler decision derives straight from the relatively simple one-sentence (!) scheduling concept outlined above. Everything that tasks 'get' is something they 'earned' before and all the scheduler does are micro-decisions based on math with the nanosec-granularity values. Both the rbtree and nanosec accounting are a straight consequence of this too: they are the tools that allow the implementation of this concept in the highest-quality way. It's certainly a very exciting experiment to me and the feedback 'from the field' is very promising so far. [...] and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq->avg_weight_load /* a running average of the weighted load on the CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ yes. rq->nr_running is really just a first-level approximation of rq->raw_weighted_load. I concentrated on the 'nice 0' case initially. I appreciate that the notion of basing the expected wait on the task's average cpu use per scheduling cycle is counter intuitive but I believe that (if you think about it) you'll see that it actually makes sense. hm. So far i tried to not do any statistical approach anywhere: the p->wait_runtime metric (which drives the task ordering) is in essence an absolutely precise 'integral' of the 'expected runtimes' that the task observes and hence is a precise "load-average as observed by the task" To me this is statistics :-) in itself. Every time we base some metric on an average value we introduce noise into the system. i definitely agree with your suggestion that CFS should use a nice-scaled metric for 'load' instead of the current rq->nr_running, but regarding the basic calculations i'd rather lean towards using rq->raw_weighted_load. Hm? This can result in jerkiness (in my experience) but using the smoothed version is certainly something that can be tried later rather than sooner. Perhaps just something to bear in mind as a solution to "jerkiness" if it manifests. your suggestion concentrates on the following scenario: if a task happens to schedule in an 'unlucky' way and happens to hit a busy period while there are many idle periods. Unless i misunderstood your suggestion, that is the main intention behind it, correct? You misunderstand (that's one of my other schedulers :-)). This one's based on the premise that if everything happens as the task expects it will get the amount of CPU bandwidth (over this short period) that it's entitled to. In reality, sometimes it will get more and sometimes less but on average it should get what it deserves. E.g. If you had two tasks with equal nice and both had demands of 90% of a CPU you'd expect them each to get about half of the CPU bandwidth. Now suppose that one of them uses 5ms of CPU each time it got onto the CPU and the other uses 10ms. If these two tasks just round robin with each other the likely outcome is that the one with the 10ms bursts will get twice as much CPU as the other but my proposed method should prevent and cause them to get roughly the same amount of CPU. (I believe this was a scenario that caused problems with O(1) and required a fix at some stage?) BTW this has the advantage that the decay rate used in calculating the task's statistics can be used to control how quickly the scheduler reacts to changes in the task's behaviour. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 04:02:41PM +1000, Peter Williams wrote: > Willy Tarreau wrote: > >On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: > >>Ingo Molnar wrote: > >>>- bugfix: use constant offset factor for nice levels instead of > >>> sched_granularity_ns. Thus nice levels work even if someone sets > >>> sched_granularity_ns to 0. NOTE: nice support is still naive, i'll > >>> address the many nice level related suggestions in -v4. > >>I have a suggestion I'd like to make that addresses both nice and > >>fairness at the same time. As I understand the basic principle behind > >>this scheduler it to work out a time by which a task should make it onto > >>the CPU and then place it into an ordered list (based on this value) of > >>tasks waiting for the CPU. I think that this is a great idea and my > >>suggestion is with regard to a method for working out this time that > >>takes into account both fairness and nice. > >> > >>First suppose we have the following metrics available in addition to > >>what's already provided. > >> > >>rq->avg_weight_load /* a running average of the weighted load on the CPU > >>*/ > >>p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the > >>CPU each scheduling cycle */ > >> > >>where a scheduling cycle for a task starts when it is placed on the > >>queue after waking or being preempted and ends when it is taken off the > >>CPU either voluntarily or after being preempted. So > >>p->avg_cpu_per_cycle is just the average amount of time p spends on the > >>CPU each time it gets on to the CPU. Sorry for the long explanation > >>here but I just wanted to make sure there was no chance that "scheduling > >>cycle" would be construed as some mechanism being imposed on the > >>scheduler.) > >> > >>We can then define: > >> > >>effective_weighted_load = max(rq->raw_weighted_load, > >>rq->avg_weighted_load) > >> > >>If p is just waking (i.e. it's not on the queue and its load_weight is > >>not included in rq->raw_weighted_load) and we need to queue it, we say > >>that the maximum time (in all fairness) that p should have to wait to > >>get onto the CPU is: > >> > >>expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / > >>p->load_weight > >> > >>Calculating p->avg_cpu_per_cycle costs one add, one multiply and one > >>shift right per scheduling cycle of the task. An additional cost is > >>that you need a shift right to get the nanosecond value from value > >>stored in the task struct. (i.e. the above code is simplified to give > >>the general idea). The average would be number of cycles based rather > >>than time based and (happily) this simplifies the calculations. > >> > >>If the expected time to get onto the CPU (i.e. expected_wait plus the > >>current time) for p is earlier than the equivalent time for the > >>currently running task then preemption of that task would be justified. > > > >I 100% agree on this method because I came to nearly the same conclusion on > >paper about 1 year ago. What I'd like to add is that the expected wake up > >time > >is not the most precise criterion for fairness. > > It's not the expected wake up time being computed. It's the expected > time that the task is selected to run after being put on the queue > (either as a result of waking or being pre-empted). Sorry, maybe I've not chosen the appropriate words. When I said "wakeup time", I meant "the date at which the task must reach the CPU". They are the same when the task is alone, but different when other tasks are run before. But my goal definitely was to avoid using this alone and use the expected completion time instead, from which the "on CPU" date can be computed. The on-CPU time should be reduced if the task ran too long on previous iterations, and enlarged if the task did not consume all of its last timeslice, hence the idea behind the credit. I think that it is important for streaming processes (eg: mp3 players) which will not always consume the same amount of CPU but are very close to an average after across a few timeslices. [...] > In practice, it might be prudent to use the idea of maximum expected "on > CPU" time instead of the average (especially if you're going to use this > for triggering pre-emption). yes > This would be the average plus two or > three time the standard deviation and has the down side that you have to > calculate the standard deviation and that means a square root (of course > we could use some less mathematically rigorous metric to approximate the > standard deviation). You just have not to do the square root, and compare the unsquared values instead. > The advantage would be that streamer type programs > such as audio/video applications would have very small standard > deviations and would hence get earlier expected "on CPU" times than > other tasks with similar averages but more random distribution. I like this idea. Regards, Willy - To unsubscribe from this list:
Re: [patch] CFS scheduler, v3
* Peter Williams <[EMAIL PROTECTED]> wrote: > > - bugfix: use constant offset factor for nice levels instead of > > sched_granularity_ns. Thus nice levels work even if someone sets > > sched_granularity_ns to 0. NOTE: nice support is still naive, i'll > > address the many nice level related suggestions in -v4. > > I have a suggestion I'd like to make that addresses both nice and > fairness at the same time. As I understand the basic principle behind > this scheduler it to work out a time by which a task should make it > onto the CPU and then place it into an ordered list (based on this > value) of tasks waiting for the CPU. I think that this is a great idea > [...] yes, that's exactly the main idea behind CFS, and thanks for the compliment :) Under this concept the scheduler never really has to guess: every scheduler decision derives straight from the relatively simple one-sentence (!) scheduling concept outlined above. Everything that tasks 'get' is something they 'earned' before and all the scheduler does are micro-decisions based on math with the nanosec-granularity values. Both the rbtree and nanosec accounting are a straight consequence of this too: they are the tools that allow the implementation of this concept in the highest-quality way. It's certainly a very exciting experiment to me and the feedback 'from the field' is very promising so far. > [...] and my suggestion is with regard to a method for working out > this time that takes into account both fairness and nice. > > First suppose we have the following metrics available in addition to > what's already provided. > > rq->avg_weight_load /* a running average of the weighted load on the > CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends > on the CPU each scheduling cycle */ yes. rq->nr_running is really just a first-level approximation of rq->raw_weighted_load. I concentrated on the 'nice 0' case initially. > I appreciate that the notion of basing the expected wait on the task's > average cpu use per scheduling cycle is counter intuitive but I > believe that (if you think about it) you'll see that it actually makes > sense. hm. So far i tried to not do any statistical approach anywhere: the p->wait_runtime metric (which drives the task ordering) is in essence an absolutely precise 'integral' of the 'expected runtimes' that the task observes and hence is a precise "load-average as observed by the task" in itself. Every time we base some metric on an average value we introduce noise into the system. i definitely agree with your suggestion that CFS should use a nice-scaled metric for 'load' instead of the current rq->nr_running, but regarding the basic calculations i'd rather lean towards using rq->raw_weighted_load. Hm? your suggestion concentrates on the following scenario: if a task happens to schedule in an 'unlucky' way and happens to hit a busy period while there are many idle periods. Unless i misunderstood your suggestion, that is the main intention behind it, correct? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Willy Tarreau wrote: On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq->avg_weight_load /* a running average of the weighted load on the CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p->avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that "scheduling cycle" would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq->raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / p->load_weight Calculating p->avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If the expected time to get onto the CPU (i.e. expected_wait plus the current time) for p is earlier than the equivalent time for the currently running task then preemption of that task would be justified. I 100% agree on this method because I came to nearly the same conclusion on paper about 1 year ago. What I'd like to add is that the expected wake up time is not the most precise criterion for fairness. It's not the expected wake up time being computed. It's the expected time that the task is selected to run after being put on the queue (either as a result of waking or being pre-empted). I think that your comments below are mainly invalid because of this understanding. The expected completion time is better. When you have one task t1 which is expected to run for T1 nanosecs and another task t2 which is expected to run for T2, what is important for the user for fairness is when the task completes its work. If t1 should wake up at time W1 and t2 at W2, then the list should be ordered by comparing W1+T1 and W2+T2. This is a point where misunderstanding results in a wrong conclusion. However, the notion of expected completion time is useful. I'd calculate the expected completion time for the task as it goes on to the CPU and if while it is running a new task is queued whose expected "on CPU" time is before the current task's expected end time you can think about preempting when the new tasks "on CPU" task arrives -- how hard/easy this would be to implement is moot. What I like with this method is that it remains fair with nice tasks because because in order to renice a task tN, you just have to change TN, and if it has to run shorter, it can be executed before CPU hogs and stay there for a very short time. Also, I found that if we want to respect interactivity, we must conserve a credit for each task. This is one point where the misunderstanding results in an invalid conclusion. There is no need for credits as the use of the average time on CPU each cycle makes the necessary corrections that credits would be used to address. It is a bounded amount of CPU time left to be used. When the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not spend T3 nsec on the CPU, but only N3computed like this : C3 = MAX(MAX_CREDIT, C3 + T3 - N3) And if a CPU hog uses more than its assigned time slice due to scheduler resolution, then C3 can become negative (and bounded too) : C3 = MAX(MIN_CREDIT, C3 + T3 - N3) Now how is
Re: [patch] CFS scheduler, v3
Willy Tarreau wrote: On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq->avg_weight_load /* a running average of the weighted load on the CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p->avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that "scheduling cycle" would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq->raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / p->load_weight Calculating p->avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If the expected time to get onto the CPU (i.e. expected_wait plus the current time) for p is earlier than the equivalent time for the currently running task then preemption of that task would be justified. I 100% agree on this method because I came to nearly the same conclusion on paper about 1 year ago. What I'd like to add is that the expected wake up time is not the most precise criterion for fairness. It's not the expected wake up time being computed. It's the expected time that the task is selected to run after being put on the queue (either as a result of waking or being pre-empted). I think that your comments below are mainly invalid because of this understanding. The expected completion time is better. When you have one task t1 which is expected to run for T1 nanosecs and another task t2 which is expected to run for T2, what is important for the user for fairness is when the task completes its work. If t1 should wake up at time W1 and t2 at W2, then the list should be ordered by comparing W1+T1 and W2+T2. This is a point where misunderstanding results in a wrong conclusion. However, the notion of expected completion time is useful. I'd calculate the expected completion time for the task as it goes on to the CPU and if while it is running a new task is queued whose expected "on CPU" time is before the current task's expected end time you can think about preempting when the new tasks "on CPU" task arrives -- how hard/easy this would be to implement is moot. What I like with this method is that it remains fair with nice tasks because because in order to renice a task tN, you just have to change TN, and if it has to run shorter, it can be executed before CPU hogs and stay there for a very short time. Also, I found that if we want to respect interactivity, we must conserve a credit for each task. This is one point where the misunderstanding results in an invalid conclusion. There is no need for credits as the use of the average time on CPU each cycle makes the necessary corrections that credits would be used to address. It is a bounded amount of CPU time left to be used. When the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not spend T3 nsec on the CPU, but only N3 I think this is overcomplicating matters. However, some compensating credit might be in order for tasks that get pre-empted e.g. base their next expected "on CPU" time on the difference between their average on CPU time and they amount they got to use before they were pre-empted.
Re: [patch] CFS scheduler, v3
Willy Tarreau wrote: On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p-avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that scheduling cycle would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq-raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / p-load_weight Calculating p-avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If the expected time to get onto the CPU (i.e. expected_wait plus the current time) for p is earlier than the equivalent time for the currently running task then preemption of that task would be justified. I 100% agree on this method because I came to nearly the same conclusion on paper about 1 year ago. What I'd like to add is that the expected wake up time is not the most precise criterion for fairness. It's not the expected wake up time being computed. It's the expected time that the task is selected to run after being put on the queue (either as a result of waking or being pre-empted). I think that your comments below are mainly invalid because of this understanding. The expected completion time is better. When you have one task t1 which is expected to run for T1 nanosecs and another task t2 which is expected to run for T2, what is important for the user for fairness is when the task completes its work. If t1 should wake up at time W1 and t2 at W2, then the list should be ordered by comparing W1+T1 and W2+T2. This is a point where misunderstanding results in a wrong conclusion. However, the notion of expected completion time is useful. I'd calculate the expected completion time for the task as it goes on to the CPU and if while it is running a new task is queued whose expected on CPU time is before the current task's expected end time you can think about preempting when the new tasks on CPU task arrives -- how hard/easy this would be to implement is moot. What I like with this method is that it remains fair with nice tasks because because in order to renice a task tN, you just have to change TN, and if it has to run shorter, it can be executed before CPU hogs and stay there for a very short time. Also, I found that if we want to respect interactivity, we must conserve a credit for each task. This is one point where the misunderstanding results in an invalid conclusion. There is no need for credits as the use of the average time on CPU each cycle makes the necessary corrections that credits would be used to address. It is a bounded amount of CPU time left to be used. When the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not spend T3 nsec on the CPU, but only N3T3, then we have a credit C3 computed like this : C3 = MAX(MAX_CREDIT, C3 + T3 - N3) And if a CPU hog uses more than its assigned time slice due to scheduler resolution, then C3 can become negative (and bounded too) : C3 = MAX(MIN_CREDIT, C3 + T3 - N3) Now how is the credit used ? Simple: the
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Willy Tarreau wrote: On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p-avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that scheduling cycle would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq-raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / p-load_weight Calculating p-avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If the expected time to get onto the CPU (i.e. expected_wait plus the current time) for p is earlier than the equivalent time for the currently running task then preemption of that task would be justified. I 100% agree on this method because I came to nearly the same conclusion on paper about 1 year ago. What I'd like to add is that the expected wake up time is not the most precise criterion for fairness. It's not the expected wake up time being computed. It's the expected time that the task is selected to run after being put on the queue (either as a result of waking or being pre-empted). I think that your comments below are mainly invalid because of this understanding. The expected completion time is better. When you have one task t1 which is expected to run for T1 nanosecs and another task t2 which is expected to run for T2, what is important for the user for fairness is when the task completes its work. If t1 should wake up at time W1 and t2 at W2, then the list should be ordered by comparing W1+T1 and W2+T2. This is a point where misunderstanding results in a wrong conclusion. However, the notion of expected completion time is useful. I'd calculate the expected completion time for the task as it goes on to the CPU and if while it is running a new task is queued whose expected on CPU time is before the current task's expected end time you can think about preempting when the new tasks on CPU task arrives -- how hard/easy this would be to implement is moot. What I like with this method is that it remains fair with nice tasks because because in order to renice a task tN, you just have to change TN, and if it has to run shorter, it can be executed before CPU hogs and stay there for a very short time. Also, I found that if we want to respect interactivity, we must conserve a credit for each task. This is one point where the misunderstanding results in an invalid conclusion. There is no need for credits as the use of the average time on CPU each cycle makes the necessary corrections that credits would be used to address. It is a bounded amount of CPU time left to be used. When the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not spend T3 nsec on the CPU, but only N3T3, then we have a credit C3 computed like this : C3 = MAX(MAX_CREDIT, C3 + T3 - N3) And if a CPU hog uses more than its assigned time slice due to scheduler resolution, then C3 can become negative (and bounded too) : C3 = MAX(MIN_CREDIT, C3 + T3 - N3)
Re: [patch] CFS scheduler, v3
* Peter Williams [EMAIL PROTECTED] wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea [...] yes, that's exactly the main idea behind CFS, and thanks for the compliment :) Under this concept the scheduler never really has to guess: every scheduler decision derives straight from the relatively simple one-sentence (!) scheduling concept outlined above. Everything that tasks 'get' is something they 'earned' before and all the scheduler does are micro-decisions based on math with the nanosec-granularity values. Both the rbtree and nanosec accounting are a straight consequence of this too: they are the tools that allow the implementation of this concept in the highest-quality way. It's certainly a very exciting experiment to me and the feedback 'from the field' is very promising so far. [...] and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ yes. rq-nr_running is really just a first-level approximation of rq-raw_weighted_load. I concentrated on the 'nice 0' case initially. I appreciate that the notion of basing the expected wait on the task's average cpu use per scheduling cycle is counter intuitive but I believe that (if you think about it) you'll see that it actually makes sense. hm. So far i tried to not do any statistical approach anywhere: the p-wait_runtime metric (which drives the task ordering) is in essence an absolutely precise 'integral' of the 'expected runtimes' that the task observes and hence is a precise load-average as observed by the task in itself. Every time we base some metric on an average value we introduce noise into the system. i definitely agree with your suggestion that CFS should use a nice-scaled metric for 'load' instead of the current rq-nr_running, but regarding the basic calculations i'd rather lean towards using rq-raw_weighted_load. Hm? your suggestion concentrates on the following scenario: if a task happens to schedule in an 'unlucky' way and happens to hit a busy period while there are many idle periods. Unless i misunderstood your suggestion, that is the main intention behind it, correct? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 04:02:41PM +1000, Peter Williams wrote: Willy Tarreau wrote: On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p-avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that scheduling cycle would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq-raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / p-load_weight Calculating p-avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If the expected time to get onto the CPU (i.e. expected_wait plus the current time) for p is earlier than the equivalent time for the currently running task then preemption of that task would be justified. I 100% agree on this method because I came to nearly the same conclusion on paper about 1 year ago. What I'd like to add is that the expected wake up time is not the most precise criterion for fairness. It's not the expected wake up time being computed. It's the expected time that the task is selected to run after being put on the queue (either as a result of waking or being pre-empted). Sorry, maybe I've not chosen the appropriate words. When I said wakeup time, I meant the date at which the task must reach the CPU. They are the same when the task is alone, but different when other tasks are run before. But my goal definitely was to avoid using this alone and use the expected completion time instead, from which the on CPU date can be computed. The on-CPU time should be reduced if the task ran too long on previous iterations, and enlarged if the task did not consume all of its last timeslice, hence the idea behind the credit. I think that it is important for streaming processes (eg: mp3 players) which will not always consume the same amount of CPU but are very close to an average after across a few timeslices. [...] In practice, it might be prudent to use the idea of maximum expected on CPU time instead of the average (especially if you're going to use this for triggering pre-emption). yes This would be the average plus two or three time the standard deviation and has the down side that you have to calculate the standard deviation and that means a square root (of course we could use some less mathematically rigorous metric to approximate the standard deviation). You just have not to do the square root, and compare the unsquared values instead. The advantage would be that streamer type programs such as audio/video applications would have very small standard deviations and would hence get earlier expected on CPU times than other tasks with similar averages but more random distribution. I like this idea. Regards, Willy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea [...] yes, that's exactly the main idea behind CFS, and thanks for the compliment :) Under this concept the scheduler never really has to guess: every scheduler decision derives straight from the relatively simple one-sentence (!) scheduling concept outlined above. Everything that tasks 'get' is something they 'earned' before and all the scheduler does are micro-decisions based on math with the nanosec-granularity values. Both the rbtree and nanosec accounting are a straight consequence of this too: they are the tools that allow the implementation of this concept in the highest-quality way. It's certainly a very exciting experiment to me and the feedback 'from the field' is very promising so far. [...] and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ yes. rq-nr_running is really just a first-level approximation of rq-raw_weighted_load. I concentrated on the 'nice 0' case initially. I appreciate that the notion of basing the expected wait on the task's average cpu use per scheduling cycle is counter intuitive but I believe that (if you think about it) you'll see that it actually makes sense. hm. So far i tried to not do any statistical approach anywhere: the p-wait_runtime metric (which drives the task ordering) is in essence an absolutely precise 'integral' of the 'expected runtimes' that the task observes and hence is a precise load-average as observed by the task To me this is statistics :-) in itself. Every time we base some metric on an average value we introduce noise into the system. i definitely agree with your suggestion that CFS should use a nice-scaled metric for 'load' instead of the current rq-nr_running, but regarding the basic calculations i'd rather lean towards using rq-raw_weighted_load. Hm? This can result in jerkiness (in my experience) but using the smoothed version is certainly something that can be tried later rather than sooner. Perhaps just something to bear in mind as a solution to jerkiness if it manifests. your suggestion concentrates on the following scenario: if a task happens to schedule in an 'unlucky' way and happens to hit a busy period while there are many idle periods. Unless i misunderstood your suggestion, that is the main intention behind it, correct? You misunderstand (that's one of my other schedulers :-)). This one's based on the premise that if everything happens as the task expects it will get the amount of CPU bandwidth (over this short period) that it's entitled to. In reality, sometimes it will get more and sometimes less but on average it should get what it deserves. E.g. If you had two tasks with equal nice and both had demands of 90% of a CPU you'd expect them each to get about half of the CPU bandwidth. Now suppose that one of them uses 5ms of CPU each time it got onto the CPU and the other uses 10ms. If these two tasks just round robin with each other the likely outcome is that the one with the 10ms bursts will get twice as much CPU as the other but my proposed method should prevent and cause them to get roughly the same amount of CPU. (I believe this was a scenario that caused problems with O(1) and required a fix at some stage?) BTW this has the advantage that the decay rate used in calculating the task's statistics can be used to control how quickly the scheduler reacts to changes in the task's behaviour. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea [...] yes, that's exactly the main idea behind CFS, and thanks for the compliment :) Under this concept the scheduler never really has to guess: every scheduler decision derives straight from the relatively simple one-sentence (!) scheduling concept outlined above. Everything that tasks 'get' is something they 'earned' before and all the scheduler does are micro-decisions based on math with the nanosec-granularity values. Both the rbtree and nanosec accounting are a straight consequence of this too: they are the tools that allow the implementation of this concept in the highest-quality way. It's certainly a very exciting experiment to me and the feedback 'from the field' is very promising so far. [...] and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ yes. rq-nr_running is really just a first-level approximation of rq-raw_weighted_load. I concentrated on the 'nice 0' case initially. I appreciate that the notion of basing the expected wait on the task's average cpu use per scheduling cycle is counter intuitive but I believe that (if you think about it) you'll see that it actually makes sense. hm. So far i tried to not do any statistical approach anywhere: the p-wait_runtime metric (which drives the task ordering) is in essence an absolutely precise 'integral' of the 'expected runtimes' that the task observes and hence is a precise load-average as observed by the task To me this is statistics :-) in itself. Every time we base some metric on an average value we introduce noise into the system. i definitely agree with your suggestion that CFS should use a nice-scaled metric for 'load' instead of the current rq-nr_running, but regarding the basic calculations i'd rather lean towards using rq-raw_weighted_load. Hm? This can result in jerkiness (in my experience) but using the smoothed version is certainly something that can be tried later rather than sooner. Perhaps just something to bear in mind as a solution to jerkiness if it manifests. your suggestion concentrates on the following scenario: if a task happens to schedule in an 'unlucky' way and happens to hit a busy period while there are many idle periods. Unless i misunderstood your suggestion, that is the main intention behind it, correct? You misunderstand (that's one of my other schedulers :-)). This one's based on the premise that if everything happens as the task expects it will get the amount of CPU bandwidth (over this short period) that it's entitled to. In reality, sometimes it will get more and sometimes less but on average it should get what it deserves. E.g. If you had two tasks with equal nice and both had demands of 90% of a CPU you'd expect them each to get about half of the CPU bandwidth. Now suppose that one of them uses 5ms of CPU each time it got onto the CPU and the other uses 10ms. If these two tasks just round robin with each other the likely outcome is that the one with the 10ms bursts will get twice as much CPU as the other but my proposed method should prevent and cause them to get roughly the same amount of CPU. (I believe this was a scenario that caused problems with O(1) and required a fix at some stage?) BTW this has the advantage that the decay rate used in calculating the task's statistics can be used to control how quickly the scheduler reacts to changes in the task's behaviour. I think that, with this model, if the current task hasn't surrendered the CPU when the next task on the queue's on CPU time arrives that the current task should be pre-empted in favour of that task. I'm not sure what would be the best way to implement this. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo
Re: [patch] CFS scheduler, v3
* Peter Williams [EMAIL PROTECTED] wrote: BTW Given that I'm right and dynamic priorities have been dispensed with what do you intend exporting (in their place) to user space for display by top and similar? well i thought of only displaying static ones, i.e. like the current patch does. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Peter Williams wrote: Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p-avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that scheduling cycle would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq-raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / p-load_weight I just realized that this is wrong for the case where p is being woken. In that case the length of the last sleep should be subtracted from the above value and if the result is negative pre-empt straight away. So expected_wait becomes: expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / p-load_weight - p-length_of_last_sleep As the length of the last sleep will be being calculated during the wake process there's no real need for it to be a task field and a local variable could be used instead. For a task being requeued during pre-emption the equation would be: expected_wait = time_just_spent_on_the_cpu * effective_weighted_load / p-load_weight as there was zero sleep since last time on CPU. Using the actual time on the CPU so far this time (instead of the average) will compensate the task for being pre-empted. Calculating p-avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If you don't like using the average CPU time per cycle, you could just use the length of time on the CPU for the last time the task was on the CPU. This would simplify things a lot. I've been thinking about the jerkiness (or lack of smoothness) that I said would occur if smoothed averages weren't used and I've realised that what I was talking about was observed jerkiness in the dynamic priorities of the tasks. As this scheduler has dispensed with dynamic priorities (as far as I can see) the jerkiness probably won't be apparent. BTW Given that I'm right and dynamic priorities have been dispensed with what do you intend exporting (in their place) to user space for display by top and similar? Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. Hmm. Let me take a look... On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ I'm suspicious of mean service times not paired with mean inter-arrival times. On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p-avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that scheduling cycle would be construed as some mechanism being imposed on the scheduler.) I went and looked up priority queueing queueing theory garbage and re-derived various things I needed. The basics check out. Probably no one cares that I checked. On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: We can then define: effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq-raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / p-load_weight This doesn't look right, probably because the scaling factor of p-avg_cpu_per_cycle is the reciprocal of its additive contribution to the -avg_weight_load as opposed to a direct estimate of its initial delay or waiting time before completing its current requested service. p-load_weight/effective_weighted_load more properly represents an entitlement to CPU bandwidth. p-avg_cpu_per_cycle/(p-load_weight/effective_weighted_load) would be more like the expected time spent on the runqueue (whether waiting to run or actually running) for a time interval spent in a runnable state and the expected time runnable and waiting to run in such an interval would be p-avg_cpu_per_cycle*(1-effective_weighted_load/p-load_weight), Neither represents the initial delay between entering the runqeueue and first acquiring the CPU, but that's a bit hard to figure out without deciding the scheduling policy up-front anyway. This essentially doesn't look correct because while you want to enforce the CPU bandwidth allocation, this doesn't have much to do with that apart from the CPU bandwidth appearing as a term. It's more properly a rate of service as opposed to a time at which anything should happen or a number useful for predicting such. When service should begin more properly depends on the other tasks in the system and a number of other decisions that are part of the scheduling policy. If you want to choose a quasi-inter-arrival time to achieve the specified CPU bandwidth allocation, this would be it, but to use that to actually enforce the CPU bandwidth allocation, you would need to take into account the genuine inter-arrival time to choose an actual time for service to begin. In other words, this should be a quota for the task to have waited. If it's not waited long enough, then it should be delayed by the difference to achieve the inter-arrival time you're trying to enforce. If it's waited longer, it should be executed sooner modulo other constraints, and perhaps even credited for future scheduling cycles. In order to partially avoid underallocating CPU bandwidth to p, one should track the time last spent sleeping and do the following: last_sleep = now - p-last_went_to_sleep; wait = p-avg_cpu_per_cycle*effective_weighted_load/p-load_weight; wait = wait last_sleep ? wait - last_sleep : 0; By and large the scheduler can deterministically choose waits on the runqueue but it doesn't actually do that. Some additional corrections for delays beyond those decided up-front while on the runqueue or getting scheduled early on the runqueue may also help, though I'm not as interested in them as I am the one for sleep: last_wait = now - p-last_taken_off_the_cpu; wait =
Re: [patch] CFS scheduler, v3
On Wed, 18 Apr 2007, William Lee Irwin III wrote: Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. False sharing for a per cpu data structure? Are we updating that structure from other processors? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Wed, 18 Apr 2007, William Lee Irwin III wrote: Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote: False sharing for a per cpu data structure? Are we updating that structure from other processors? Primarily in the load balancer, but also in wakeups. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, 20 Apr 2007, William Lee Irwin III wrote: On Wed, 18 Apr 2007, William Lee Irwin III wrote: Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote: False sharing for a per cpu data structure? Are we updating that structure from other processors? Primarily in the load balancer, but also in wakeups. That is fairly rare I think. What other variables that are also writtten frequently would cause false sharing? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote: False sharing for a per cpu data structure? Are we updating that structure from other processors? On Fri, 20 Apr 2007, William Lee Irwin III wrote: Primarily in the load balancer, but also in wakeups. On Fri, Apr 20, 2007 at 12:33:13PM -0700, Christoph Lameter wrote: That is fairly rare I think. What other variables that are also writtten frequently would cause false sharing? I wrote that backward, sorry. Cross-CPU wakeups' frequency depend heavily on the workload. Probably the only other case I can think of is io_schedule() but that's not really significant. I'm not really convinced it's all that worthwhile of an optimization, essentially for the same reasons as you, but presumably there's a benchmark result somewhere that says it matters. I've just not seen it. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, 20 Apr 2007, William Lee Irwin III wrote: I'm not really convinced it's all that worthwhile of an optimization, essentially for the same reasons as you, but presumably there's a benchmark result somewhere that says it matters. I've just not seen it. If it is true that we frequently remotely write the per cpu runqueue data then we may have a NUMA scalability issue. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, 20 Apr 2007, William Lee Irwin III wrote: I'm not really convinced it's all that worthwhile of an optimization, essentially for the same reasons as you, but presumably there's a benchmark result somewhere that says it matters. I've just not seen it. On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote: If it is true that we frequently remotely write the per cpu runqueue data then we may have a NUMA scalability issue. From the discussion on Suresh's thread, it appears to have sped up a database benchmark 0.5%. Last I checked it was workload-dependent, but there were things that hammer it. I mostly know of the remote wakeup issue, but there could be other things besides wakeups that do it, too. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
William Lee Irwin III wrote: On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. Hmm. Let me take a look... On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ I'm suspicious of mean service times not paired with mean inter-arrival times. On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p-avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that scheduling cycle would be construed as some mechanism being imposed on the scheduler.) I went and looked up priority queueing queueing theory garbage and re-derived various things I needed. The basics check out. Probably no one cares that I checked. On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: We can then define: effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq-raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / p-load_weight You're right. The time that the task spent sleeping before being woken should be subtracted from this value. If the answer is less than or equal to zero pre-emption should occur. This doesn't look right, probably because the scaling factor of p-avg_cpu_per_cycle is the reciprocal of its additive contribution to the -avg_weight_load as opposed to a direct estimate of its initial delay or waiting time before completing its current requested service. p-load_weight/effective_weighted_load more properly represents an entitlement to CPU bandwidth. Yes. But expected_wait isn't entitlement it's its inverse. p-avg_cpu_per_cycle/(p-load_weight/effective_weighted_load) would be more like the expected time spent on the runqueue When I went to school that would be just another way of expressing the equation that I expressed. (whether waiting to run or actually running) for a time interval spent in a runnable state and the expected time runnable and waiting to run in such an interval would be p-avg_cpu_per_cycle*(1-effective_weighted_load/p-load_weight), Neither represents the initial delay between entering the runqeueue and first acquiring the CPU, but that's a bit hard to figure out without deciding the scheduling policy up-front anyway. This essentially doesn't look correct because while you want to enforce the CPU bandwidth allocation, this doesn't have much to do with that apart from the CPU bandwidth appearing as a term. It's more properly a rate of service as opposed to a time at which anything should happen or a number useful for predicting such. When service should begin more properly depends on the other tasks in the system and a number of other decisions that are part of the scheduling policy. This model takes all of those into consideration. The idea is not just to predict but to use the calculated time to decide when to boot the current process of the CPU (if it doesn't leave voluntarily) and put this one on. This more or less removes the need to give each task a predetermined chunk of CPU when they go on to the CPU. This should, in general, reduce the number context switches as tasks get to run until they've finished what they're doing or another task becomes higher priority rather than being booted off after an arbitrary time interval. (If this ever gets tried it will be interesting to see if this prediction comes true.) BTW Even if Ingo doesn't choose to try this model, I'll probably make a patch (way in the future after Ingo's changes are settled) to try it out myself. If you want to choose a quasi-inter-arrival time to achieve the specified CPU bandwidth allocation, this would be it, but to use that to actually enforce the CPU bandwidth allocation, you would need to take into
Re: [patch] CFS scheduler, v3
William Lee Irwin III wrote: This essentially doesn't look correct because while you want to enforce the CPU bandwidth allocation, this doesn't have much to do with that apart from the CPU bandwidth appearing as a term. It's more properly a rate of service as opposed to a time at which anything should happen or a number useful for predicting such. When service should begin more properly depends on the other tasks in the system and a number of other decisions that are part of the scheduling policy. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: This model takes all of those into consideration. The idea is not just to predict but to use the calculated time to decide when to boot the current process of the CPU (if it doesn't leave voluntarily) and put this one on. This more or less removes the need to give each task a predetermined chunk of CPU when they go on to the CPU. This should, in general, reduce the number context switches as tasks get to run until they've finished what they're doing or another task becomes higher priority rather than being booted off after an arbitrary time interval. (If this ever gets tried it will be interesting to see if this prediction comes true.) BTW Even if Ingo doesn't choose to try this model, I'll probably make a patch (way in the future after Ingo's changes are settled) to try it out myself. I think I smoked out what you were doing. William Lee Irwin III wrote: If you want to choose a quasi-inter-arrival time to achieve the specified CPU bandwidth allocation, this would be it, but to use that to actually enforce the CPU bandwidth allocation, you would need to take into account the genuine inter-arrival time to choose an actual time for service to begin. In other words, this should be a quota for the task to have waited. If it's not waited long enough, then it should be delayed by the difference to achieve the inter-arrival time you're trying to enforce. If it's waited longer, it should be executed sooner modulo other constraints, and perhaps even credited for future scheduling cycles. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: The idea isn't to enforce the bandwidth entitlement to the extent of throttling tasks if they exceed their entitlement and there's no other tasks ready to use the CPU. This is mainly because the bandwidth entitlement isn't fixed -- it's changing constantly as the number and type of runnable tasks changes. Well, a little hysteresis will end up throttling in such a manner anyway as a side-effect, or you'll get anomalies. Say two tasks with equal entitlements compete, where one sleeps for 1/3 of the time and the other is fully CPU-bound. If only the times when they're in direct competition are split 50/50, then the CPU-bound task gets 2/3 and the sleeper 1/3, which is not the intended effect. I don't believe this model will be very vulnerable to it, though. William Lee Irwin III wrote: In order to partially avoid underallocating CPU bandwidth to p, one should track the time last spent sleeping and do the following: On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: Yes I made a mistake in omitting to take into account sleep interval. See another e-mail to Ingo correcting this problem. I took it to be less trivial of an error than it was. No big deal. William Lee Irwin III wrote: In order to do better, longer-term history is required, On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: The half life of the Kalman filter (roughly equivalent to a running average) used to calculate the averages determines how much history is taken into account. It could be made configurable (at least, until enough real life experience was available to decide on the best value to use). A Kalman filter would do better than a running average. I'm all for it. William Lee Irwin III wrote: To attempt to maintain an infinite history of bandwidth underutilization to be credited too far in the future would enable potentially long-term overutilization when such credits are cashed en masse for a sustained period of time. At some point you have to say use it or lose it; over a shorter period of time some smoothing is still admissible and even desirable. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: Yes, that's why I suggest a running average over the last few scheduling cycles for the task. But thinking about it some more I'm now not so sure. The lack of apparent smoothness when I've done this sort of thing with raw rather than smooth data (in modifications to the current dynamic priority based scheduler model) is usually noticed by running top and seeing wildly fluctuating dynamic priorities. I'm not sure that the actual responsiveness of the system reflects this. So I'm now willing to reserve my judgement on this issue. I'm thinking smoothing should be over a relatively short period of time,
Re: [patch] CFS scheduler, v3
William Lee Irwin III wrote: William Lee Irwin III wrote: This essentially doesn't look correct because while you want to enforce the CPU bandwidth allocation, this doesn't have much to do with that apart from the CPU bandwidth appearing as a term. It's more properly a rate of service as opposed to a time at which anything should happen or a number useful for predicting such. When service should begin more properly depends on the other tasks in the system and a number of other decisions that are part of the scheduling policy. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: This model takes all of those into consideration. The idea is not just to predict but to use the calculated time to decide when to boot the current process of the CPU (if it doesn't leave voluntarily) and put this one on. This more or less removes the need to give each task a predetermined chunk of CPU when they go on to the CPU. This should, in general, reduce the number context switches as tasks get to run until they've finished what they're doing or another task becomes higher priority rather than being booted off after an arbitrary time interval. (If this ever gets tried it will be interesting to see if this prediction comes true.) BTW Even if Ingo doesn't choose to try this model, I'll probably make a patch (way in the future after Ingo's changes are settled) to try it out myself. I think I smoked out what you were doing. William Lee Irwin III wrote: If you want to choose a quasi-inter-arrival time to achieve the specified CPU bandwidth allocation, this would be it, but to use that to actually enforce the CPU bandwidth allocation, you would need to take into account the genuine inter-arrival time to choose an actual time for service to begin. In other words, this should be a quota for the task to have waited. If it's not waited long enough, then it should be delayed by the difference to achieve the inter-arrival time you're trying to enforce. If it's waited longer, it should be executed sooner modulo other constraints, and perhaps even credited for future scheduling cycles. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: The idea isn't to enforce the bandwidth entitlement to the extent of throttling tasks if they exceed their entitlement and there's no other tasks ready to use the CPU. This is mainly because the bandwidth entitlement isn't fixed -- it's changing constantly as the number and type of runnable tasks changes. Well, a little hysteresis will end up throttling in such a manner anyway as a side-effect, Think of this as a calming influence :-) or you'll get anomalies. Say two tasks with equal entitlements compete, where one sleeps for 1/3 of the time and the other is fully CPU-bound. If only the times when they're in direct competition are split 50/50, then the CPU-bound task gets 2/3 and the sleeper 1/3, which is not the intended effect. I don't believe this model will be very vulnerable to it, though. Nor me. William Lee Irwin III wrote: In order to partially avoid underallocating CPU bandwidth to p, one should track the time last spent sleeping and do the following: On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: Yes I made a mistake in omitting to take into account sleep interval. See another e-mail to Ingo correcting this problem. I took it to be less trivial of an error than it was. No big deal. No, you were right it was definitely a NON trivial error. William Lee Irwin III wrote: In order to do better, longer-term history is required, On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: The half life of the Kalman filter (roughly equivalent to a running average) used to calculate the averages determines how much history is taken into account. It could be made configurable (at least, until enough real life experience was available to decide on the best value to use). A Kalman filter would do better than a running average. I'm all for it. As a long time user of Kalman filters I tend to think of them as the same thing. I use the term running average when talking about the idea behind a scheduler because I think that more people will understand what the general idea is. When it comes to implementation I always replace the idea of running average with a roughly equivalent Kalman filter. William Lee Irwin III wrote: To attempt to maintain an infinite history of bandwidth underutilization to be credited too far in the future would enable potentially long-term overutilization when such credits are cashed en masse for a sustained period of time. At some point you have to say use it or lose it; over a shorter period of time some smoothing is still admissible and even desirable. On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote: Yes, that's why I suggest a running average over the last few scheduling cycles for the task. But thinking about it some more I'm now not so sure.
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: > Ingo Molnar wrote: > > > > - bugfix: use constant offset factor for nice levels instead of > > sched_granularity_ns. Thus nice levels work even if someone sets > > sched_granularity_ns to 0. NOTE: nice support is still naive, i'll > > address the many nice level related suggestions in -v4. > > I have a suggestion I'd like to make that addresses both nice and > fairness at the same time. As I understand the basic principle behind > this scheduler it to work out a time by which a task should make it onto > the CPU and then place it into an ordered list (based on this value) of > tasks waiting for the CPU. I think that this is a great idea and my > suggestion is with regard to a method for working out this time that > takes into account both fairness and nice. > > First suppose we have the following metrics available in addition to > what's already provided. > > rq->avg_weight_load /* a running average of the weighted load on the CPU */ > p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the > CPU each scheduling cycle */ > > where a scheduling cycle for a task starts when it is placed on the > queue after waking or being preempted and ends when it is taken off the > CPU either voluntarily or after being preempted. So > p->avg_cpu_per_cycle is just the average amount of time p spends on the > CPU each time it gets on to the CPU. Sorry for the long explanation > here but I just wanted to make sure there was no chance that "scheduling > cycle" would be construed as some mechanism being imposed on the scheduler.) > > We can then define: > > effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load) > > If p is just waking (i.e. it's not on the queue and its load_weight is > not included in rq->raw_weighted_load) and we need to queue it, we say > that the maximum time (in all fairness) that p should have to wait to > get onto the CPU is: > > expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / > p->load_weight > > Calculating p->avg_cpu_per_cycle costs one add, one multiply and one > shift right per scheduling cycle of the task. An additional cost is > that you need a shift right to get the nanosecond value from value > stored in the task struct. (i.e. the above code is simplified to give > the general idea). The average would be number of cycles based rather > than time based and (happily) this simplifies the calculations. > > If the expected time to get onto the CPU (i.e. expected_wait plus the > current time) for p is earlier than the equivalent time for the > currently running task then preemption of that task would be justified. I 100% agree on this method because I came to nearly the same conclusion on paper about 1 year ago. What I'd like to add is that the expected wake up time is not the most precise criterion for fairness. The expected completion time is better. When you have one task t1 which is expected to run for T1 nanosecs and another task t2 which is expected to run for T2, what is important for the user for fairness is when the task completes its work. If t1 should wake up at time W1 and t2 at W2, then the list should be ordered by comparing W1+T1 and W2+T2. What I like with this method is that it remains fair with nice tasks because because in order to renice a task tN, you just have to change TN, and if it has to run shorter, it can be executed before CPU hogs and stay there for a very short time. Also, I found that if we want to respect interactivity, we must conserve a credit for each task. It is a bounded amount of CPU time left to be used. When the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not spend T3 nsec on the CPU, but only N3http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq->avg_weight_load /* a running average of the weighted load on the CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p->avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that "scheduling cycle" would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq->raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / p->load_weight Calculating p->avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If the expected time to get onto the CPU (i.e. expected_wait plus the current time) for p is earlier than the equivalent time for the currently running task then preemption of that task would be justified. I appreciate that the notion of basing the expected wait on the task's average cpu use per scheduling cycle is counter intuitive but I believe that (if you think about it) you'll see that it actually makes sense. Peter PS Some reordering of calculation order within the expressions might be in order to keep them within the range of 32 bit arithmetic and so avoid 64 bit arithmetic on 32 bit machines. -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p-avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that scheduling cycle would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq-raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / p-load_weight Calculating p-avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If the expected time to get onto the CPU (i.e. expected_wait plus the current time) for p is earlier than the equivalent time for the currently running task then preemption of that task would be justified. I appreciate that the notion of basing the expected wait on the task's average cpu use per scheduling cycle is counter intuitive but I believe that (if you think about it) you'll see that it actually makes sense. Peter PS Some reordering of calculation order within the expressions might be in order to keep them within the range of 32 bit arithmetic and so avoid 64 bit arithmetic on 32 bit machines. -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote: Ingo Molnar wrote: - bugfix: use constant offset factor for nice levels instead of sched_granularity_ns. Thus nice levels work even if someone sets sched_granularity_ns to 0. NOTE: nice support is still naive, i'll address the many nice level related suggestions in -v4. I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice. First suppose we have the following metrics available in addition to what's already provided. rq-avg_weight_load /* a running average of the weighted load on the CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */ where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p-avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that scheduling cycle would be construed as some mechanism being imposed on the scheduler.) We can then define: effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load) If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq-raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / p-load_weight Calculating p-avg_cpu_per_cycle costs one add, one multiply and one shift right per scheduling cycle of the task. An additional cost is that you need a shift right to get the nanosecond value from value stored in the task struct. (i.e. the above code is simplified to give the general idea). The average would be number of cycles based rather than time based and (happily) this simplifies the calculations. If the expected time to get onto the CPU (i.e. expected_wait plus the current time) for p is earlier than the equivalent time for the currently running task then preemption of that task would be justified. I 100% agree on this method because I came to nearly the same conclusion on paper about 1 year ago. What I'd like to add is that the expected wake up time is not the most precise criterion for fairness. The expected completion time is better. When you have one task t1 which is expected to run for T1 nanosecs and another task t2 which is expected to run for T2, what is important for the user for fairness is when the task completes its work. If t1 should wake up at time W1 and t2 at W2, then the list should be ordered by comparing W1+T1 and W2+T2. What I like with this method is that it remains fair with nice tasks because because in order to renice a task tN, you just have to change TN, and if it has to run shorter, it can be executed before CPU hogs and stay there for a very short time. Also, I found that if we want to respect interactivity, we must conserve a credit for each task. It is a bounded amount of CPU time left to be used. When the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not spend T3 nsec on the CPU, but only N3T3, then we have a credit C3 computed like this : C3 = MAX(MAX_CREDIT, C3 + T3 - N3) And if a CPU hog uses more than its assigned time slice due to scheduler resolution, then C3 can become negative (and bounded too) : C3 = MAX(MIN_CREDIT, C3 + T3 - N3) Now how is the credit used ? Simple: the credit is a part of a timeslice, so it's systematically added to the computed timeslice when ordering the tasks. So we indeed order tasks t1 and t2 by W1+T1+C1 and W2+T2+C2. It means that an interactive task which has not eaten all of timeslice will accumulate time credit have great chances of being able to run before others if it wakes up again, and even use slightly more CPU time than others if it has not used it before. Conversely, if a task eats too much CPU time, it will be punished and wait longer than others, and run less, to compensate for the advance it has taken. Also, what I like with this method is that it can correctly handle the fork/exit storms which quickly change the per-task allocated CPU time. Upon fork() or exit(), it should not be too hard to readjust Tx and Cx for each task and reorder them according to their new completion time. I've not found a way to include a variable nr_running in the tree and order the tasks
Re: [patch] CFS scheduler, v3
* William Lee Irwin III <[EMAIL PROTECTED]> wrote: > It appears to me that the following can be taken in for mainline (or > rejected for mainline) independently of the rest of the cfs patch. yeah - it's a patch written by Suresh, and this should already be in the for-v2.6.22 -mm queue. See: Subject: [patch] sched: align rq to cacheline boundary on lkml. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Wed, Apr 18, 2007 at 07:50:17PM +0200, Ingo Molnar wrote: > this is the third release of the CFS patchset (against v2.6.21-rc7), and > can be downloaded from: >http://redhat.com/~mingo/cfs-scheduler/ > this is a pure "fix reported regressions" release so there's much less > churn: >5 files changed, 71 insertions(+), 29 deletions(-) > (the added lines are mostly debug related, not genuine increase in the > scheduler's size) It appears to me that the following can be taken in for mainline (or rejected for mainline) independently of the rest of the cfs patch. -- wli Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. Index: sched/kernel/sched.c === --- sched.orig/kernel/sched.c 2007-04-18 14:10:03.593207728 -0700 +++ sched/kernel/sched.c2007-04-18 14:11:39.270660075 -0700 @@ -278,7 +278,7 @@ struct lock_class_key rq_lock_key; }; -static DEFINE_PER_CPU(struct rq, runqueues); +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; static inline int cpu_of(struct rq *rq) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
On Wed, Apr 18, 2007 at 07:50:17PM +0200, Ingo Molnar wrote: this is the third release of the CFS patchset (against v2.6.21-rc7), and can be downloaded from: http://redhat.com/~mingo/cfs-scheduler/ this is a pure fix reported regressions release so there's much less churn: 5 files changed, 71 insertions(+), 29 deletions(-) (the added lines are mostly debug related, not genuine increase in the scheduler's size) It appears to me that the following can be taken in for mainline (or rejected for mainline) independently of the rest of the cfs patch. -- wli Mark the runqueues cacheline_aligned_in_smp to avoid false sharing. Index: sched/kernel/sched.c === --- sched.orig/kernel/sched.c 2007-04-18 14:10:03.593207728 -0700 +++ sched/kernel/sched.c2007-04-18 14:11:39.270660075 -0700 @@ -278,7 +278,7 @@ struct lock_class_key rq_lock_key; }; -static DEFINE_PER_CPU(struct rq, runqueues); +static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp; static inline int cpu_of(struct rq *rq) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, v3
* William Lee Irwin III [EMAIL PROTECTED] wrote: It appears to me that the following can be taken in for mainline (or rejected for mainline) independently of the rest of the cfs patch. yeah - it's a patch written by Suresh, and this should already be in the for-v2.6.22 -mm queue. See: Subject: [patch] sched: align rq to cacheline boundary on lkml. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/