Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Tue, Apr 24, 2007 at 10:55:45AM -0700, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
> > yes, we were planning to move this to a different percpu section, where
> > all the elements in this new section will be cacheline aligned(both
> > at the start, aswell as end)
> 
> I would not call this a per cpu area. It is used by multiple cpus it 
> seems. 

not decided about the name yet. but the area is allocated per cpu and yes,
the data can be referred by other cpus.

> But for 0.5%? on what benchmark? Is is really worth it?

famous database workload :)

I don't think the new section will be added for this 0.5%. This is a straight
fwd optimization, and anyone can plug into this new section in future.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Siddha, Suresh B wrote:

> On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote:
> > On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
> > > Anyhow, this is a straight forward optimization and needs to be done. Do 
> > > you
> > > have any specific concerns?
> > 
> > Yes there should not be contention on per cpu data in principle. The 
> > point of per cpu data is for the cpu to have access to contention free 
> > cachelines.
> > 
> > If the data is contented then it should be moved out of per cpu data and 
> > properly 
> > placed to minimize contention. Otherwise we will get into cacheline 
> > aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.
> 
> yes, we were planning to move this to a different percpu section, where
> all the elements in this new section will be cacheline aligned(both
> at the start, aswell as end)

I would not call this a per cpu area. It is used by multiple cpus it 
seems. But for 0.5%? on what benchmark? Is is really worth it?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
> > Anyhow, this is a straight forward optimization and needs to be done. Do you
> > have any specific concerns?
> 
> Yes there should not be contention on per cpu data in principle. The 
> point of per cpu data is for the cpu to have access to contention free 
> cachelines.
> 
> If the data is contented then it should be moved out of per cpu data and 
> properly 
> placed to minimize contention. Otherwise we will get into cacheline 
> aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.

yes, we were planning to move this to a different percpu section, where
all the elements in this new section will be cacheline aligned(both
at the start, aswell as end)

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Siddha, Suresh B wrote:

> > .5% is usually in the noise ratio. Are you consistently seeing an 
> > improvement or is that sporadic?
> 
> No. This is consistent. I am waiting for the perf data on a much much bigger
> NUMA box.
> 
> Anyhow, this is a straight forward optimization and needs to be done. Do you
> have any specific concerns?

Yes there should not be contention on per cpu data in principle. The 
point of per cpu data is for the cpu to have access to contention free 
cachelines.

If the data is contented then it should be moved out of per cpu data and 
properly 
placed to minimize contention. Otherwise we will get into cacheline 
aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Tue, Apr 24, 2007 at 10:39:48AM -0700, Christoph Lameter wrote:
> On Fri, 20 Apr 2007, Siddha, Suresh B wrote:
> 
> > > Last I checked it was workload-dependent, but there were things that
> > > hammer it. I mostly know of the remote wakeup issue, but there could
> > > be other things besides wakeups that do it, too.
> > 
> > remote wakeup was the main issue and the 0.5% improvement was seen
> > on a two node platform. Aligning it reduces the number of remote
> > cachelines that needs to be touched as part of this wakeup.
> 
> .5% is usually in the noise ratio. Are you consistently seeing an 
> improvement or is that sporadic?

No. This is consistent. I am waiting for the perf data on a much much bigger
NUMA box.

Anyhow, this is a straight forward optimization and needs to be done. Do you
have any specific concerns?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Christoph Lameter
On Fri, 20 Apr 2007, Siddha, Suresh B wrote:

> > Last I checked it was workload-dependent, but there were things that
> > hammer it. I mostly know of the remote wakeup issue, but there could
> > be other things besides wakeups that do it, too.
> 
> remote wakeup was the main issue and the 0.5% improvement was seen
> on a two node platform. Aligning it reduces the number of remote
> cachelines that needs to be touched as part of this wakeup.

.5% is usually in the noise ratio. Are you consistently seeing an 
improvement or is that sporadic?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
> On Wed, 18 Apr 2007, William Lee Irwin III wrote:
> 
> > 
> > Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.
> 
> False sharing for a per cpu data structure? Are we updating that 
> structure from other processors?

yes. during a remote process wakeup.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Fri, Apr 20, 2007 at 01:03:22PM -0700, William Lee Irwin III wrote:
> On Fri, 20 Apr 2007, William Lee Irwin III wrote:
> >> I'm not really convinced it's all that worthwhile of an optimization,
> >> essentially for the same reasons as you, but presumably there's a
> >> benchmark result somewhere that says it matters. I've just not seen it.
> 
> On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote:
> > If it is true that we frequently remotely write the per cpu runqueue 
> > data then we may have a NUMA scalability issue.
> 
> From the discussion on Suresh's thread, it appears to have sped up a
> database benchmark 0.5%.
> 
> Last I checked it was workload-dependent, but there were things that
> hammer it. I mostly know of the remote wakeup issue, but there could
> be other things besides wakeups that do it, too.

remote wakeup was the main issue and the 0.5% improvement was seen
on a two node platform. Aligning it reduces the number of remote
cachelines that needs to be touched as part of this wakeup.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
 On Wed, 18 Apr 2007, William Lee Irwin III wrote:
 
  
  Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.
 
 False sharing for a per cpu data structure? Are we updating that 
 structure from other processors?

yes. during a remote process wakeup.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Fri, Apr 20, 2007 at 01:03:22PM -0700, William Lee Irwin III wrote:
 On Fri, 20 Apr 2007, William Lee Irwin III wrote:
  I'm not really convinced it's all that worthwhile of an optimization,
  essentially for the same reasons as you, but presumably there's a
  benchmark result somewhere that says it matters. I've just not seen it.
 
 On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote:
  If it is true that we frequently remotely write the per cpu runqueue 
  data then we may have a NUMA scalability issue.
 
 From the discussion on Suresh's thread, it appears to have sped up a
 database benchmark 0.5%.
 
 Last I checked it was workload-dependent, but there were things that
 hammer it. I mostly know of the remote wakeup issue, but there could
 be other things besides wakeups that do it, too.

remote wakeup was the main issue and the 0.5% improvement was seen
on a two node platform. Aligning it reduces the number of remote
cachelines that needs to be touched as part of this wakeup.

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Christoph Lameter
On Fri, 20 Apr 2007, Siddha, Suresh B wrote:

  Last I checked it was workload-dependent, but there were things that
  hammer it. I mostly know of the remote wakeup issue, but there could
  be other things besides wakeups that do it, too.
 
 remote wakeup was the main issue and the 0.5% improvement was seen
 on a two node platform. Aligning it reduces the number of remote
 cachelines that needs to be touched as part of this wakeup.

.5% is usually in the noise ratio. Are you consistently seeing an 
improvement or is that sporadic?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Tue, Apr 24, 2007 at 10:39:48AM -0700, Christoph Lameter wrote:
 On Fri, 20 Apr 2007, Siddha, Suresh B wrote:
 
   Last I checked it was workload-dependent, but there were things that
   hammer it. I mostly know of the remote wakeup issue, but there could
   be other things besides wakeups that do it, too.
  
  remote wakeup was the main issue and the 0.5% improvement was seen
  on a two node platform. Aligning it reduces the number of remote
  cachelines that needs to be touched as part of this wakeup.
 
 .5% is usually in the noise ratio. Are you consistently seeing an 
 improvement or is that sporadic?

No. This is consistent. I am waiting for the perf data on a much much bigger
NUMA box.

Anyhow, this is a straight forward optimization and needs to be done. Do you
have any specific concerns?

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Siddha, Suresh B wrote:

  .5% is usually in the noise ratio. Are you consistently seeing an 
  improvement or is that sporadic?
 
 No. This is consistent. I am waiting for the perf data on a much much bigger
 NUMA box.
 
 Anyhow, this is a straight forward optimization and needs to be done. Do you
 have any specific concerns?

Yes there should not be contention on per cpu data in principle. The 
point of per cpu data is for the cpu to have access to contention free 
cachelines.

If the data is contented then it should be moved out of per cpu data and 
properly 
placed to minimize contention. Otherwise we will get into cacheline 
aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote:
 On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
  Anyhow, this is a straight forward optimization and needs to be done. Do you
  have any specific concerns?
 
 Yes there should not be contention on per cpu data in principle. The 
 point of per cpu data is for the cpu to have access to contention free 
 cachelines.
 
 If the data is contented then it should be moved out of per cpu data and 
 properly 
 placed to minimize contention. Otherwise we will get into cacheline 
 aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.

yes, we were planning to move this to a different percpu section, where
all the elements in this new section will be cacheline aligned(both
at the start, aswell as end)

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Christoph Lameter
On Tue, 24 Apr 2007, Siddha, Suresh B wrote:

 On Tue, Apr 24, 2007 at 10:47:45AM -0700, Christoph Lameter wrote:
  On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
   Anyhow, this is a straight forward optimization and needs to be done. Do 
   you
   have any specific concerns?
  
  Yes there should not be contention on per cpu data in principle. The 
  point of per cpu data is for the cpu to have access to contention free 
  cachelines.
  
  If the data is contented then it should be moved out of per cpu data and 
  properly 
  placed to minimize contention. Otherwise we will get into cacheline 
  aliases (__read_mostly in per cpu??) etc etc in the per cpu areas.
 
 yes, we were planning to move this to a different percpu section, where
 all the elements in this new section will be cacheline aligned(both
 at the start, aswell as end)

I would not call this a per cpu area. It is used by multiple cpus it 
seems. But for 0.5%? on what benchmark? Is is really worth it?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-24 Thread Siddha, Suresh B
On Tue, Apr 24, 2007 at 10:55:45AM -0700, Christoph Lameter wrote:
 On Tue, 24 Apr 2007, Siddha, Suresh B wrote:
  yes, we were planning to move this to a different percpu section, where
  all the elements in this new section will be cacheline aligned(both
  at the start, aswell as end)
 
 I would not call this a per cpu area. It is used by multiple cpus it 
 seems. 

not decided about the name yet. but the area is allocated per cpu and yes,
the data can be referred by other cpus.

 But for 0.5%? on what benchmark? Is is really worth it?

famous database workload :)

I don't think the new section will be added for this 0.5%. This is a straight
fwd optimization, and anyone can plug into this new section in future.

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread William Lee Irwin III
* William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> Suppose a table of nice weights like the following is tuned via 
>> /proc/:
>> -20  21   0  1
>>  -1  2   19  0.0476
> > Essentially 1/(n+1) when n >= 0 and 1-n when n < 0.

On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
> ok, thanks for thinking about it. I have changed the nice weight in 
> CVSv5-to-be so that it defaults to something pretty close to your 
> suggestion: the ratio between a nice 0 loop and a nice 19 loop is now 
> set to about 2%. (This something that users requested for some time, the 
> default ~5% is a tad high when running reniced SETI jobs, etc.)

Okay. Maybe what I suggested is too weak vs. too strong. I didn't
actually have it in mind as a proposal for general use, but maybe it is
good for such. I had more in mind tunability in general, but it's all
good. I'd think some curve gentler in intermediate nice levels and
stronger at the tails might be better.


On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
> the actual percentage scales almost directly with the nice offset 
> granularity value, but if this should be exposed to users at all, i 
> agree that it would be better to directly expose this as some sort of 
> 'ratio between nice 0 and nice 19 tasks', right? Or some other, more 
> finegrained metric. Percentile is too coarse i think, and using 0.1% 
> units isnt intuitive enough i think. The sysctl handler would then 
> transform that 'human readable' sysctl value into the appropriate 
> internal nice-offset-granularity value (or whatever mechanism the 
> implementation ends up using).

I vaguely liked specifying the full table, but maybe it's too much
for a real user interface.

4-digit or 5-digit fixed point decimal sounds reasonable.


On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
> I'd not do this as a per-nice-level thing but as a single value that 
> rescales the whole nice level range at once. That's alot less easy to 
> misconfigure and we've got enough nice levels for users to pick from 
> almost arbitrarily, as long as they have the ability to influence the 
> max.
> does this sound mostly OK to you?

For the most part, yes. I've been mostly looking at how effectively
the prioritization algorithms work. I'll be wrapping up writing a
testcase to measure all this soon. The basic idea is to take the
weights as inputs somehow and then check to see that they're honored.

What's appropriate for end-users is a very different thing from what
might be appropriate for me. I won't have trouble fiddling with the
code, so please do design around what the best interface for end-users
might be.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Peter Williams

Peter Williams wrote:

Ingo Molnar wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:

I retract this suggestion as it's a very bad idea.  It introduces the 
possibility of starvation via the poor sods at the bottom of the 
queue having their "on CPU" forever postponed and we all know that 
even the smallest possibility of starvation will eventually cause 
problems.


I think there should be a rule: Once a task is on the queue its "on 
CPU" time is immutable.


Yeah, fully agreed. Currently i'm using the simple method of 
p->nice_offset, which plainly just moves the per nice level areas of 
the tree far enough apart (by a constant offset) so that lower nice 
levels rarely interact with higher nice levels. Lower nice levels 
never truly starve because rq->fair_clock increases deterministically 
and currently the fair_key values are indeed 'immutable' as you suggest.


In practice they can starve a bit when one renices thousands of tasks, 
so i was thinking about the following special-case: to at least make 
them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 
task then we could 'share' its p->wait_runtime with that nice 19 task 
and copy the signal sender's nice_offset. This would in essence pass 
the right to execute over to the killed task, so that it can tear 
itself down.


This cannot be used to gain an 'unfair advantage' because the signal 
sender spends its own 'right to execute on the CPU', and because the 
target task cannot execute any user code anymore when it gets a SIGKILL.


In any case, it is clear that rq->raw_cpu_load should be used instead 
of rq->nr_running, when calculating the fair clock, but i begin to 
like the nice_offset solution too in addition of this: it's effective 
in practice and starvation-free in theory, and most importantly, it's 
very simple. We could even make the nice offset granularity tunable, 
just in case anyone wants to weaken (or strengthen) the effectivity of 
nice levels. What do you think, can you see any obvious (or less 
obvious) showstoppers with this approach?


I haven't had a close look at it but from the above description it 
sounds an order of magnitude more complex than I thought it would be. 
The idea of different nice levels sounds like a recipe for starvation to 
me (if it works the way it sounds like it works).


I guess I'll have to spend more time reading the code because I don't 
seem to be able to make sense of the above description in any way that 
doesn't say "starvation here we come".


I'm finding it hard to figure out what the underling principle for the 
way you're queuing things by reading the code (that's the trouble with 
reading the code it just tells you what's being done not why -- and 
sometimes it's even hard to figure out what's being done when there's a 
lot of indirection).  sched-design-CFS.txt isn't much help in this area 
either.


Any chance of a brief description of how it's supposed to work?

Key questions are:

How do you decide the key value for a task's position in the queue?

Is it an absolute time or an offset from the current time?

How do you decide when to boot the current task of the queue?  Both at 
wake up of another task and in general play?


Peter
PS I think that you're trying to do too much in one patch.
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:

I retract this suggestion as it's a very bad idea.  It introduces the 
possibility of starvation via the poor sods at the bottom of the queue 
having their "on CPU" forever postponed and we all know that even the 
smallest possibility of starvation will eventually cause problems.


I think there should be a rule: Once a task is on the queue its "on 
CPU" time is immutable.


Yeah, fully agreed. Currently i'm using the simple method of 
p->nice_offset, which plainly just moves the per nice level areas of the 
tree far enough apart (by a constant offset) so that lower nice levels 
rarely interact with higher nice levels. Lower nice levels never truly 
starve because rq->fair_clock increases deterministically and currently 
the fair_key values are indeed 'immutable' as you suggest.


In practice they can starve a bit when one renices thousands of tasks, 
so i was thinking about the following special-case: to at least make 
them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task 
then we could 'share' its p->wait_runtime with that nice 19 task and 
copy the signal sender's nice_offset. This would in essence pass the 
right to execute over to the killed task, so that it can tear itself 
down.


This cannot be used to gain an 'unfair advantage' because the signal 
sender spends its own 'right to execute on the CPU', and because the 
target task cannot execute any user code anymore when it gets a SIGKILL.


In any case, it is clear that rq->raw_cpu_load should be used instead of 
rq->nr_running, when calculating the fair clock, but i begin to like the 
nice_offset solution too in addition of this: it's effective in practice 
and starvation-free in theory, and most importantly, it's very simple. 
We could even make the nice offset granularity tunable, just in case 
anyone wants to weaken (or strengthen) the effectivity of nice levels. 
What do you think, can you see any obvious (or less obvious) 
showstoppers with this approach?


I haven't had a close look at it but from the above description it 
sounds an order of magnitude more complex than I thought it would be. 
The idea of different nice levels sounds like a recipe for starvation to 
me (if it works the way it sounds like it works).


I guess I'll have to spend more time reading the code because I don't 
seem to be able to make sense of the above description in any way that 
doesn't say "starvation here we come".


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Ingo Molnar

* William Lee Irwin III <[EMAIL PROTECTED]> wrote:

> I suppose this is a special case of the dreaded priority inversion. 
> What of, say, nice 19 tasks holding fs semaphores and/or mutexes that 
> nice -19 tasks are waiting to acquire? Perhaps rt_mutex should be the 
> default mutex implementation.

while i agree that it could be an issue, lock inversion is nothing 
really new, so i'd not go _that_ drastic to convert all mutexes to 
rtmutexes. (i've taken my -rt/PREEMPT_RT hat off)

For example reiser3 based systems get pretty laggy on significant 
reniced load (even with the vanilla scheduler) if CONFIG_PREEMPT_BKL is 
enabled: reiser3 holds the BKL for extended periods of time so a "make 
-j50" workload can starve it significantly and the tty layer's BKL use 
makes any sort of keyboard (even over ssh) input laggy.

Other locks though are not held this frequently and the mutex 
implementation is pretty fair for waiters anyway. (the semaphore 
implementation is not nearly as much fair, and the Big Kernel Semaphore 
is still struct semaphore based) So i'd really wait for specific 
workloads to trigger problems, and _maybe_ convert certain mutexes to 
rtmutexes, on an as-needed basis.

> > In any case, it is clear that rq->raw_cpu_load should be used instead of 
> > rq->nr_running, when calculating the fair clock, but i begin to like the 
> > nice_offset solution too in addition of this: it's effective in practice 
> > and starvation-free in theory, and most importantly, it's very simple. 
> > We could even make the nice offset granularity tunable, just in case 
> > anyone wants to weaken (or strengthen) the effectivity of nice levels. 
> > What do you think, can you see any obvious (or less obvious) 
> > showstoppers with this approach?
> 
> ->nice_offset's semantics are not meaningful to the end user, 
> regardless of whether it's effective. [...]

yeah, agreed. That's one reason why i didnt make it tunable, it's pretty 
meaningless to the user.

> [...] If there is something to be tuned, it should be relative shares 
> of CPU bandwidth (load_weight) corresponding to each nice level or 
> something else directly observable. The implementation could be 
> ->nice_offset, if it suffices.
> 
> Suppose a table of nice weights like the following is tuned via 
> /proc/:
> 
> -20   21   0  1
>  -1   2   19  0.0476

> Essentially 1/(n+1) when n >= 0 and 1-n when n < 0.

ok, thanks for thinking about it. I have changed the nice weight in 
CVSv5-to-be so that it defaults to something pretty close to your 
suggestion: the ratio between a nice 0 loop and a nice 19 loop is now 
set to about 2%. (This something that users requested for some time, the 
default ~5% is a tad high when running reniced SETI jobs, etc.)

the actual percentage scales almost directly with the nice offset 
granularity value, but if this should be exposed to users at all, i 
agree that it would be better to directly expose this as some sort of 
'ratio between nice 0 and nice 19 tasks', right? Or some other, more 
finegrained metric. Percentile is too coarse i think, and using 0.1% 
units isnt intuitive enough i think. The sysctl handler would then 
transform that 'human readable' sysctl value into the appropriate 
internal nice-offset-granularity value (or whatever mechanism the 
implementation ends up using).

I'd not do this as a per-nice-level thing but as a single value that 
rescales the whole nice level range at once. That's alot less easy to 
misconfigure and we've got enough nice levels for users to pick from 
almost arbitrarily, as long as they have the ability to influence the 
max.

does this sound mostly OK to you?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread William Lee Irwin III
On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote:
> In practice they can starve a bit when one renices thousands of tasks, 
> so i was thinking about the following special-case: to at least make 
> them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task 
> then we could 'share' its p->wait_runtime with that nice 19 task and 
> copy the signal sender's nice_offset. This would in essence pass the 
> right to execute over to the killed task, so that it can tear itself 
> down.
> This cannot be used to gain an 'unfair advantage' because the signal 
> sender spends its own 'right to execute on the CPU', and because the 
> target task cannot execute any user code anymore when it gets a SIGKILL.

I suppose this is a special case of the dreaded priority inversion. What
of, say, nice 19 tasks holding fs semaphores and/or mutexes that nice
-19 tasks are waiting to acquire? Perhaps rt_mutex should be the default
mutex implementation.


On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote:
> In any case, it is clear that rq->raw_cpu_load should be used instead of 
> rq->nr_running, when calculating the fair clock, but i begin to like the 
> nice_offset solution too in addition of this: it's effective in practice 
> and starvation-free in theory, and most importantly, it's very simple. 
> We could even make the nice offset granularity tunable, just in case 
> anyone wants to weaken (or strengthen) the effectivity of nice levels. 
> What do you think, can you see any obvious (or less obvious) 
> showstoppers with this approach?

->nice_offset's semantics are not meaningful to the end user, regardless
of whether it's effective. If there is something to be tuned, it should
be relative shares of CPU bandwidth (load_weight) corresponding to each
nice level or something else directly observable. The implementation
could be ->nice_offset, if it suffices.

Suppose a table of nice weights like the following is tuned via /proc/:

-20 21   0  1
-19 20   1  0.5
-18 19   2  0.
-17 18   3  0.25
-16 17   4  0.2
-15 16   5  0.1666
-14 15   6  0.1429
-13 14   7  0.125
-12 13   8  0.
-11 12   9  0.1
-10 11  10  0.0909
 -9 10  11  0.0833
 -8 9   12  0.0769
 -7 8   13  0.0714
 -6 7   14  0.0666
 -5 6   15  0.0625
 -4 5   16  0.0588
 -3 4   17  0.0555
 -2 3   18  0.0526
 -1 2   19  0.0476

Essentially 1/(n+1) when n >= 0 and 1-n when n < 0.

Now suppose two CPU hogs, one being nice 1 and the other nice -1,
compete against each other. The nice 1 hog should receive 0.5/(2+0.5)
= 20% of the CPU, and the nice -1 hog should receive 80% of the CPU.
A nice -5 hog vs. a nice 0 hog should have the nice 0 hog receive
1/(6+1) = 14.29% of the CPU and the nice -5 hog receive 85.71% of the
CPU. Three hogs, once nice -1, one nice 0, and one nice 1 should have
the nice -1 hog receive 2/(2+1+0.5) = 57.14% of the CPU, the nice 0
hog receive 1/(2+1+0.5) = 28.57% of the CPU, and the nice 1 hog
receive 0.5/(2+1+0.5) = 14.29% of the CPU. And so on. (Note that these
are very strong nice semantics, probably too strong for real use.) For
five hogs with nice numbers ranging from -2 to 2, we'd get:
-2  43.90%
-1  29.27%
 0  14.63%
 1   7.32%
 2   4.88%

However this is implemented doesn't matter so long as the semantics
are honored. I wouldn't specify too much beyond competing tasks with
equivalent sleep/wakeup behavior; one may wish to apply bonuses and
penalties to differing sleep/wakeup behavior in the future, even if
such is not now done.

Basically, the effect of nice numbers should be predictable somehow,
and frankly, making it specifiable by the user resolves the issue
of conflicting requirements.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Peter Williams

Peter Williams wrote:

Peter Williams wrote:

Ingo Molnar wrote:
your suggestion concentrates on the following scenario: if a task 
happens to schedule in an 'unlucky' way and happens to hit a busy 
period while there are many idle periods. Unless i misunderstood your 
suggestion, that is the main intention behind it, correct?


You misunderstand (that's one of my other schedulers :-)).  This one's 
based on the premise that if everything happens as the task expects it 
will get the amount of CPU bandwidth (over this short period) that 
it's entitled to.  In reality, sometimes it will get more and 
sometimes less but on average it should get what it deserves. E.g. If 
you had two tasks with equal nice and both had demands of 90% of a CPU 
you'd expect them each to get about half of the CPU bandwidth.  Now 
suppose that one of them uses 5ms of CPU each time it got onto the CPU 
and the other uses 10ms.  If these two tasks just round robin with 
each other the likely outcome is that the one with the 10ms bursts 
will get twice as much CPU as the other but my proposed method should 
prevent and cause them to get roughly the same amount of CPU.  (I 
believe this was a scenario that caused problems with O(1) and 
required a fix at some stage?)


Another advantage of this mechanism is that, all else being equal, it 
will tend to run tasks that use short bursts of CPU ahead of those that 
use long bursts and this tends to reduce the overall time spent waiting 
for CPU by all tasks on the system which is good for throughput.  I.e. 
in general, a task that tends to use short bursts of CPU will make other 
tasks wait less time than will one that tends to use long bursts.


So this means that you were right and it is good in the scenario that 
you suggested even though that wasn't the motivation behind the design.


This means that this scheduler should be good for improving latency on 
servers that aren't fully loaded as well as providing good fairness and 
responsiveness when the system is fully loaded.  (Fingers crossed.)


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Ingo Molnar

* Peter Williams <[EMAIL PROTECTED]> wrote:

> I retract this suggestion as it's a very bad idea.  It introduces the 
> possibility of starvation via the poor sods at the bottom of the queue 
> having their "on CPU" forever postponed and we all know that even the 
> smallest possibility of starvation will eventually cause problems.
> 
> I think there should be a rule: Once a task is on the queue its "on 
> CPU" time is immutable.

Yeah, fully agreed. Currently i'm using the simple method of 
p->nice_offset, which plainly just moves the per nice level areas of the 
tree far enough apart (by a constant offset) so that lower nice levels 
rarely interact with higher nice levels. Lower nice levels never truly 
starve because rq->fair_clock increases deterministically and currently 
the fair_key values are indeed 'immutable' as you suggest.

In practice they can starve a bit when one renices thousands of tasks, 
so i was thinking about the following special-case: to at least make 
them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task 
then we could 'share' its p->wait_runtime with that nice 19 task and 
copy the signal sender's nice_offset. This would in essence pass the 
right to execute over to the killed task, so that it can tear itself 
down.

This cannot be used to gain an 'unfair advantage' because the signal 
sender spends its own 'right to execute on the CPU', and because the 
target task cannot execute any user code anymore when it gets a SIGKILL.

In any case, it is clear that rq->raw_cpu_load should be used instead of 
rq->nr_running, when calculating the fair clock, but i begin to like the 
nice_offset solution too in addition of this: it's effective in practice 
and starvation-free in theory, and most importantly, it's very simple. 
We could even make the nice offset granularity tunable, just in case 
anyone wants to weaken (or strengthen) the effectivity of nice levels. 
What do you think, can you see any obvious (or less obvious) 
showstoppers with this approach?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Peter Williams

Peter Williams wrote:

William Lee Irwin III wrote:

William Lee Irwin III wrote:
On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
If some form of precise timer was used (instead) to trigger 
pre-emption then, where there is more than one task with the same 
expected "on CPU" time, only the last would get any CPU (and that's 
only the case if some infinite loop doesn't get created).


I think you can check for this and adjust their on-cpu times in various
ways and choose what order to run them in. I think it'd end up something
of a missed deadline policy.


Arguably the latest one should go first as its estimate is more likely 
to be correct being based on more recent info i.e. the other's estimates 
would be based on less runnable tasks and be optimistic. This would 
involve pushing them down the queue and pushing items already on the 
queue downwards is likely to be expensive.  Which is why I prefer the 
less elegant solution of the periodical interrupt.


Of course, one solution might be to just add the adjustment to the key 
time on all tasks on the queue?


I retract this suggestion as it's a very bad idea.  It introduces the 
possibility of starvation via the poor sods at the bottom of the queue 
having their "on CPU" forever postponed and we all know that even the 
smallest possibility of starvation will eventually cause problems.


I think there should be a rule: Once a task is on the queue its "on CPU" 
time is immutable.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Peter Williams

Peter Williams wrote:

William Lee Irwin III wrote:

William Lee Irwin III wrote:
On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
If some form of precise timer was used (instead) to trigger 
pre-emption then, where there is more than one task with the same 
expected on CPU time, only the last would get any CPU (and that's 
only the case if some infinite loop doesn't get created).


I think you can check for this and adjust their on-cpu times in various
ways and choose what order to run them in. I think it'd end up something
of a missed deadline policy.


Arguably the latest one should go first as its estimate is more likely 
to be correct being based on more recent info i.e. the other's estimates 
would be based on less runnable tasks and be optimistic. This would 
involve pushing them down the queue and pushing items already on the 
queue downwards is likely to be expensive.  Which is why I prefer the 
less elegant solution of the periodical interrupt.


Of course, one solution might be to just add the adjustment to the key 
time on all tasks on the queue?


I retract this suggestion as it's a very bad idea.  It introduces the 
possibility of starvation via the poor sods at the bottom of the queue 
having their on CPU forever postponed and we all know that even the 
smallest possibility of starvation will eventually cause problems.


I think there should be a rule: Once a task is on the queue its on CPU 
time is immutable.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Ingo Molnar

* Peter Williams [EMAIL PROTECTED] wrote:

 I retract this suggestion as it's a very bad idea.  It introduces the 
 possibility of starvation via the poor sods at the bottom of the queue 
 having their on CPU forever postponed and we all know that even the 
 smallest possibility of starvation will eventually cause problems.
 
 I think there should be a rule: Once a task is on the queue its on 
 CPU time is immutable.

Yeah, fully agreed. Currently i'm using the simple method of 
p-nice_offset, which plainly just moves the per nice level areas of the 
tree far enough apart (by a constant offset) so that lower nice levels 
rarely interact with higher nice levels. Lower nice levels never truly 
starve because rq-fair_clock increases deterministically and currently 
the fair_key values are indeed 'immutable' as you suggest.

In practice they can starve a bit when one renices thousands of tasks, 
so i was thinking about the following special-case: to at least make 
them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task 
then we could 'share' its p-wait_runtime with that nice 19 task and 
copy the signal sender's nice_offset. This would in essence pass the 
right to execute over to the killed task, so that it can tear itself 
down.

This cannot be used to gain an 'unfair advantage' because the signal 
sender spends its own 'right to execute on the CPU', and because the 
target task cannot execute any user code anymore when it gets a SIGKILL.

In any case, it is clear that rq-raw_cpu_load should be used instead of 
rq-nr_running, when calculating the fair clock, but i begin to like the 
nice_offset solution too in addition of this: it's effective in practice 
and starvation-free in theory, and most importantly, it's very simple. 
We could even make the nice offset granularity tunable, just in case 
anyone wants to weaken (or strengthen) the effectivity of nice levels. 
What do you think, can you see any obvious (or less obvious) 
showstoppers with this approach?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Peter Williams

Peter Williams wrote:

Peter Williams wrote:

Ingo Molnar wrote:
your suggestion concentrates on the following scenario: if a task 
happens to schedule in an 'unlucky' way and happens to hit a busy 
period while there are many idle periods. Unless i misunderstood your 
suggestion, that is the main intention behind it, correct?


You misunderstand (that's one of my other schedulers :-)).  This one's 
based on the premise that if everything happens as the task expects it 
will get the amount of CPU bandwidth (over this short period) that 
it's entitled to.  In reality, sometimes it will get more and 
sometimes less but on average it should get what it deserves. E.g. If 
you had two tasks with equal nice and both had demands of 90% of a CPU 
you'd expect them each to get about half of the CPU bandwidth.  Now 
suppose that one of them uses 5ms of CPU each time it got onto the CPU 
and the other uses 10ms.  If these two tasks just round robin with 
each other the likely outcome is that the one with the 10ms bursts 
will get twice as much CPU as the other but my proposed method should 
prevent and cause them to get roughly the same amount of CPU.  (I 
believe this was a scenario that caused problems with O(1) and 
required a fix at some stage?)


Another advantage of this mechanism is that, all else being equal, it 
will tend to run tasks that use short bursts of CPU ahead of those that 
use long bursts and this tends to reduce the overall time spent waiting 
for CPU by all tasks on the system which is good for throughput.  I.e. 
in general, a task that tends to use short bursts of CPU will make other 
tasks wait less time than will one that tends to use long bursts.


So this means that you were right and it is good in the scenario that 
you suggested even though that wasn't the motivation behind the design.


This means that this scheduler should be good for improving latency on 
servers that aren't fully loaded as well as providing good fairness and 
responsiveness when the system is fully loaded.  (Fingers crossed.)


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread William Lee Irwin III
On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote:
 In practice they can starve a bit when one renices thousands of tasks, 
 so i was thinking about the following special-case: to at least make 
 them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task 
 then we could 'share' its p-wait_runtime with that nice 19 task and 
 copy the signal sender's nice_offset. This would in essence pass the 
 right to execute over to the killed task, so that it can tear itself 
 down.
 This cannot be used to gain an 'unfair advantage' because the signal 
 sender spends its own 'right to execute on the CPU', and because the 
 target task cannot execute any user code anymore when it gets a SIGKILL.

I suppose this is a special case of the dreaded priority inversion. What
of, say, nice 19 tasks holding fs semaphores and/or mutexes that nice
-19 tasks are waiting to acquire? Perhaps rt_mutex should be the default
mutex implementation.


On Sat, Apr 21, 2007 at 09:54:01AM +0200, Ingo Molnar wrote:
 In any case, it is clear that rq-raw_cpu_load should be used instead of 
 rq-nr_running, when calculating the fair clock, but i begin to like the 
 nice_offset solution too in addition of this: it's effective in practice 
 and starvation-free in theory, and most importantly, it's very simple. 
 We could even make the nice offset granularity tunable, just in case 
 anyone wants to weaken (or strengthen) the effectivity of nice levels. 
 What do you think, can you see any obvious (or less obvious) 
 showstoppers with this approach?

-nice_offset's semantics are not meaningful to the end user, regardless
of whether it's effective. If there is something to be tuned, it should
be relative shares of CPU bandwidth (load_weight) corresponding to each
nice level or something else directly observable. The implementation
could be -nice_offset, if it suffices.

Suppose a table of nice weights like the following is tuned via /proc/:

-20 21   0  1
-19 20   1  0.5
-18 19   2  0.
-17 18   3  0.25
-16 17   4  0.2
-15 16   5  0.1666
-14 15   6  0.1429
-13 14   7  0.125
-12 13   8  0.
-11 12   9  0.1
-10 11  10  0.0909
 -9 10  11  0.0833
 -8 9   12  0.0769
 -7 8   13  0.0714
 -6 7   14  0.0666
 -5 6   15  0.0625
 -4 5   16  0.0588
 -3 4   17  0.0555
 -2 3   18  0.0526
 -1 2   19  0.0476

Essentially 1/(n+1) when n = 0 and 1-n when n  0.

Now suppose two CPU hogs, one being nice 1 and the other nice -1,
compete against each other. The nice 1 hog should receive 0.5/(2+0.5)
= 20% of the CPU, and the nice -1 hog should receive 80% of the CPU.
A nice -5 hog vs. a nice 0 hog should have the nice 0 hog receive
1/(6+1) = 14.29% of the CPU and the nice -5 hog receive 85.71% of the
CPU. Three hogs, once nice -1, one nice 0, and one nice 1 should have
the nice -1 hog receive 2/(2+1+0.5) = 57.14% of the CPU, the nice 0
hog receive 1/(2+1+0.5) = 28.57% of the CPU, and the nice 1 hog
receive 0.5/(2+1+0.5) = 14.29% of the CPU. And so on. (Note that these
are very strong nice semantics, probably too strong for real use.) For
five hogs with nice numbers ranging from -2 to 2, we'd get:
-2  43.90%
-1  29.27%
 0  14.63%
 1   7.32%
 2   4.88%

However this is implemented doesn't matter so long as the semantics
are honored. I wouldn't specify too much beyond competing tasks with
equivalent sleep/wakeup behavior; one may wish to apply bonuses and
penalties to differing sleep/wakeup behavior in the future, even if
such is not now done.

Basically, the effect of nice numbers should be predictable somehow,
and frankly, making it specifiable by the user resolves the issue
of conflicting requirements.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Ingo Molnar

* William Lee Irwin III [EMAIL PROTECTED] wrote:

 I suppose this is a special case of the dreaded priority inversion. 
 What of, say, nice 19 tasks holding fs semaphores and/or mutexes that 
 nice -19 tasks are waiting to acquire? Perhaps rt_mutex should be the 
 default mutex implementation.

while i agree that it could be an issue, lock inversion is nothing 
really new, so i'd not go _that_ drastic to convert all mutexes to 
rtmutexes. (i've taken my -rt/PREEMPT_RT hat off)

For example reiser3 based systems get pretty laggy on significant 
reniced load (even with the vanilla scheduler) if CONFIG_PREEMPT_BKL is 
enabled: reiser3 holds the BKL for extended periods of time so a make 
-j50 workload can starve it significantly and the tty layer's BKL use 
makes any sort of keyboard (even over ssh) input laggy.

Other locks though are not held this frequently and the mutex 
implementation is pretty fair for waiters anyway. (the semaphore 
implementation is not nearly as much fair, and the Big Kernel Semaphore 
is still struct semaphore based) So i'd really wait for specific 
workloads to trigger problems, and _maybe_ convert certain mutexes to 
rtmutexes, on an as-needed basis.

  In any case, it is clear that rq-raw_cpu_load should be used instead of 
  rq-nr_running, when calculating the fair clock, but i begin to like the 
  nice_offset solution too in addition of this: it's effective in practice 
  and starvation-free in theory, and most importantly, it's very simple. 
  We could even make the nice offset granularity tunable, just in case 
  anyone wants to weaken (or strengthen) the effectivity of nice levels. 
  What do you think, can you see any obvious (or less obvious) 
  showstoppers with this approach?
 
 -nice_offset's semantics are not meaningful to the end user, 
 regardless of whether it's effective. [...]

yeah, agreed. That's one reason why i didnt make it tunable, it's pretty 
meaningless to the user.

 [...] If there is something to be tuned, it should be relative shares 
 of CPU bandwidth (load_weight) corresponding to each nice level or 
 something else directly observable. The implementation could be 
 -nice_offset, if it suffices.
 
 Suppose a table of nice weights like the following is tuned via 
 /proc/:
 
 -20   21   0  1
  -1   2   19  0.0476

 Essentially 1/(n+1) when n = 0 and 1-n when n  0.

ok, thanks for thinking about it. I have changed the nice weight in 
CVSv5-to-be so that it defaults to something pretty close to your 
suggestion: the ratio between a nice 0 loop and a nice 19 loop is now 
set to about 2%. (This something that users requested for some time, the 
default ~5% is a tad high when running reniced SETI jobs, etc.)

the actual percentage scales almost directly with the nice offset 
granularity value, but if this should be exposed to users at all, i 
agree that it would be better to directly expose this as some sort of 
'ratio between nice 0 and nice 19 tasks', right? Or some other, more 
finegrained metric. Percentile is too coarse i think, and using 0.1% 
units isnt intuitive enough i think. The sysctl handler would then 
transform that 'human readable' sysctl value into the appropriate 
internal nice-offset-granularity value (or whatever mechanism the 
implementation ends up using).

I'd not do this as a per-nice-level thing but as a single value that 
rescales the whole nice level range at once. That's alot less easy to 
misconfigure and we've got enough nice levels for users to pick from 
almost arbitrarily, as long as they have the ability to influence the 
max.

does this sound mostly OK to you?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams [EMAIL PROTECTED] wrote:

I retract this suggestion as it's a very bad idea.  It introduces the 
possibility of starvation via the poor sods at the bottom of the queue 
having their on CPU forever postponed and we all know that even the 
smallest possibility of starvation will eventually cause problems.


I think there should be a rule: Once a task is on the queue its on 
CPU time is immutable.


Yeah, fully agreed. Currently i'm using the simple method of 
p-nice_offset, which plainly just moves the per nice level areas of the 
tree far enough apart (by a constant offset) so that lower nice levels 
rarely interact with higher nice levels. Lower nice levels never truly 
starve because rq-fair_clock increases deterministically and currently 
the fair_key values are indeed 'immutable' as you suggest.


In practice they can starve a bit when one renices thousands of tasks, 
so i was thinking about the following special-case: to at least make 
them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 task 
then we could 'share' its p-wait_runtime with that nice 19 task and 
copy the signal sender's nice_offset. This would in essence pass the 
right to execute over to the killed task, so that it can tear itself 
down.


This cannot be used to gain an 'unfair advantage' because the signal 
sender spends its own 'right to execute on the CPU', and because the 
target task cannot execute any user code anymore when it gets a SIGKILL.


In any case, it is clear that rq-raw_cpu_load should be used instead of 
rq-nr_running, when calculating the fair clock, but i begin to like the 
nice_offset solution too in addition of this: it's effective in practice 
and starvation-free in theory, and most importantly, it's very simple. 
We could even make the nice offset granularity tunable, just in case 
anyone wants to weaken (or strengthen) the effectivity of nice levels. 
What do you think, can you see any obvious (or less obvious) 
showstoppers with this approach?


I haven't had a close look at it but from the above description it 
sounds an order of magnitude more complex than I thought it would be. 
The idea of different nice levels sounds like a recipe for starvation to 
me (if it works the way it sounds like it works).


I guess I'll have to spend more time reading the code because I don't 
seem to be able to make sense of the above description in any way that 
doesn't say starvation here we come.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread Peter Williams

Peter Williams wrote:

Ingo Molnar wrote:

* Peter Williams [EMAIL PROTECTED] wrote:

I retract this suggestion as it's a very bad idea.  It introduces the 
possibility of starvation via the poor sods at the bottom of the 
queue having their on CPU forever postponed and we all know that 
even the smallest possibility of starvation will eventually cause 
problems.


I think there should be a rule: Once a task is on the queue its on 
CPU time is immutable.


Yeah, fully agreed. Currently i'm using the simple method of 
p-nice_offset, which plainly just moves the per nice level areas of 
the tree far enough apart (by a constant offset) so that lower nice 
levels rarely interact with higher nice levels. Lower nice levels 
never truly starve because rq-fair_clock increases deterministically 
and currently the fair_key values are indeed 'immutable' as you suggest.


In practice they can starve a bit when one renices thousands of tasks, 
so i was thinking about the following special-case: to at least make 
them easily killable: if a nice 0 task sends a SIGKILL to a nice 19 
task then we could 'share' its p-wait_runtime with that nice 19 task 
and copy the signal sender's nice_offset. This would in essence pass 
the right to execute over to the killed task, so that it can tear 
itself down.


This cannot be used to gain an 'unfair advantage' because the signal 
sender spends its own 'right to execute on the CPU', and because the 
target task cannot execute any user code anymore when it gets a SIGKILL.


In any case, it is clear that rq-raw_cpu_load should be used instead 
of rq-nr_running, when calculating the fair clock, but i begin to 
like the nice_offset solution too in addition of this: it's effective 
in practice and starvation-free in theory, and most importantly, it's 
very simple. We could even make the nice offset granularity tunable, 
just in case anyone wants to weaken (or strengthen) the effectivity of 
nice levels. What do you think, can you see any obvious (or less 
obvious) showstoppers with this approach?


I haven't had a close look at it but from the above description it 
sounds an order of magnitude more complex than I thought it would be. 
The idea of different nice levels sounds like a recipe for starvation to 
me (if it works the way it sounds like it works).


I guess I'll have to spend more time reading the code because I don't 
seem to be able to make sense of the above description in any way that 
doesn't say starvation here we come.


I'm finding it hard to figure out what the underling principle for the 
way you're queuing things by reading the code (that's the trouble with 
reading the code it just tells you what's being done not why -- and 
sometimes it's even hard to figure out what's being done when there's a 
lot of indirection).  sched-design-CFS.txt isn't much help in this area 
either.


Any chance of a brief description of how it's supposed to work?

Key questions are:

How do you decide the key value for a task's position in the queue?

Is it an absolute time or an offset from the current time?

How do you decide when to boot the current task of the queue?  Both at 
wake up of another task and in general play?


Peter
PS I think that you're trying to do too much in one patch.
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-21 Thread William Lee Irwin III
* William Lee Irwin III [EMAIL PROTECTED] wrote:
 Suppose a table of nice weights like the following is tuned via 
 /proc/:
 -20  21   0  1
  -1  2   19  0.0476
  Essentially 1/(n+1) when n = 0 and 1-n when n  0.

On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
 ok, thanks for thinking about it. I have changed the nice weight in 
 CVSv5-to-be so that it defaults to something pretty close to your 
 suggestion: the ratio between a nice 0 loop and a nice 19 loop is now 
 set to about 2%. (This something that users requested for some time, the 
 default ~5% is a tad high when running reniced SETI jobs, etc.)

Okay. Maybe what I suggested is too weak vs. too strong. I didn't
actually have it in mind as a proposal for general use, but maybe it is
good for such. I had more in mind tunability in general, but it's all
good. I'd think some curve gentler in intermediate nice levels and
stronger at the tails might be better.


On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
 the actual percentage scales almost directly with the nice offset 
 granularity value, but if this should be exposed to users at all, i 
 agree that it would be better to directly expose this as some sort of 
 'ratio between nice 0 and nice 19 tasks', right? Or some other, more 
 finegrained metric. Percentile is too coarse i think, and using 0.1% 
 units isnt intuitive enough i think. The sysctl handler would then 
 transform that 'human readable' sysctl value into the appropriate 
 internal nice-offset-granularity value (or whatever mechanism the 
 implementation ends up using).

I vaguely liked specifying the full table, but maybe it's too much
for a real user interface.

4-digit or 5-digit fixed point decimal sounds reasonable.


On Sat, Apr 21, 2007 at 10:57:29AM +0200, Ingo Molnar wrote:
 I'd not do this as a per-nice-level thing but as a single value that 
 rescales the whole nice level range at once. That's alot less easy to 
 misconfigure and we've got enough nice levels for users to pick from 
 almost arbitrarily, as long as they have the ability to influence the 
 max.
 does this sound mostly OK to you?

For the most part, yes. I've been mostly looking at how effectively
the prioritization algorithms work. I'll be wrapping up writing a
testcase to measure all this soon. The basic idea is to take the
weights as inputs somehow and then check to see that they're honored.

What's appropriate for end-users is a very different thing from what
might be appropriate for me. I won't have trouble fiddling with the
code, so please do design around what the best interface for end-users
might be.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

William Lee Irwin III wrote:

William Lee Irwin III wrote:

This essentially doesn't look correct because while you want to enforce
the CPU bandwidth allocation, this doesn't have much to do with that
apart from the CPU bandwidth appearing as a term. It's more properly
a rate of service as opposed to a time at which anything should happen
or a number useful for predicting such. When service should begin more
properly depends on the other tasks in the system and a number of other
decisions that are part of the scheduling policy.


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
This model takes all of those into consideration.  The idea is not just 
to predict but to use the calculated time to decide when to boot the 
current process of the CPU (if it doesn't leave voluntarily) and put 
this one on.  This more or less removes the need to give each task a 
predetermined chunk of CPU when they go on to the CPU.  This should, in 
general, reduce the number context switches as tasks get to run until 
they've finished what they're doing or another task becomes higher 
priority rather than being booted off after an arbitrary time interval. 
 (If this ever gets tried it will be interesting to see if this 
prediction comes true.)
BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
patch (way in the future after Ingo's changes are settled) to try it out 
myself.


I think I smoked out what you were doing.


William Lee Irwin III wrote:

If you want to choose a "quasi-inter-arrival time" to achieve the
specified CPU bandwidth allocation, this would be it, but to use that
to actually enforce the CPU bandwidth allocation, you would need to
take into account the genuine inter-arrival time to choose an actual
time for service to begin. In other words, this should be a quota for
the task to have waited. If it's not waited long enough, then it should
be delayed by the difference to achieve the inter-arrival time you're
trying to enforce. If it's waited longer, it should be executed
sooner modulo other constraints, and perhaps even credited for future
scheduling cycles.


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
The idea isn't to enforce the bandwidth entitlement to the extent of 
throttling tasks if they exceed their entitlement and there's no other 
tasks ready to use the CPU.  This is mainly because the bandwidth 
entitlement isn't fixed -- it's changing constantly as the number and 
type of runnable tasks changes.


Well, a little hysteresis will end up throttling in such a manner
anyway as a side-effect,


Think of this as a calming influence :-)


or you'll get anomalies. Say two tasks with
equal entitlements compete, where one sleeps for 1/3 of the time and
the other is fully CPU-bound. If only the times when they're in direct
competition are split 50/50, then the CPU-bound task gets 2/3 and the
sleeper 1/3, which is not the intended effect. I don't believe this
model will be very vulnerable to it, though.


Nor me.




William Lee Irwin III wrote:

In order to partially avoid underallocating CPU bandwidth to p, one
should track the time last spent sleeping and do the following:


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
Yes I made a mistake in omitting to take into account sleep interval. 
See another e-mail to Ingo correcting this problem.


I took it to be less trivial of an error than it was. No big deal.


No, you were right it was definitely a NON trivial error.




William Lee Irwin III wrote:

In order to do better, longer-term history is required,


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
The half life of the Kalman filter (roughly equivalent to a running 
average) used to calculate the averages determines how much history is 
taken into account.  It could be made configurable (at least, until 
enough real life experience was available to decide on the best value to 
use).


A Kalman filter would do better than a running average. I'm all for it.


As a long time user of Kalman filters I tend to think of them as the 
same thing.  I use the term running average when talking about the idea 
behind a scheduler because I think that more people will understand what 
the general idea is.  When it comes to implementation I always replace 
the idea of "running average" with a roughly equivalent Kalman filter.





William Lee Irwin III wrote:

To attempt to maintain an infinite history of
bandwidth underutilization to be credited too far in the future would
enable potentially long-term overutilization when such credits are
cashed en masse for a sustained period of time. At some point you have
to say "use it or lose it;" over a shorter period of time some smoothing
is still admissible and even desirable.


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
Yes, that's why I suggest a running average over the last few scheduling 
cycles for the task.  But thinking about it some more I'm now not so 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
William Lee Irwin III wrote:
>> This essentially doesn't look correct because while you want to enforce
>> the CPU bandwidth allocation, this doesn't have much to do with that
>> apart from the CPU bandwidth appearing as a term. It's more properly
>> a rate of service as opposed to a time at which anything should happen
>> or a number useful for predicting such. When service should begin more
>> properly depends on the other tasks in the system and a number of other
>> decisions that are part of the scheduling policy.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> This model takes all of those into consideration.  The idea is not just 
> to predict but to use the calculated time to decide when to boot the 
> current process of the CPU (if it doesn't leave voluntarily) and put 
> this one on.  This more or less removes the need to give each task a 
> predetermined chunk of CPU when they go on to the CPU.  This should, in 
> general, reduce the number context switches as tasks get to run until 
> they've finished what they're doing or another task becomes higher 
> priority rather than being booted off after an arbitrary time interval. 
>  (If this ever gets tried it will be interesting to see if this 
> prediction comes true.)
> BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
> patch (way in the future after Ingo's changes are settled) to try it out 
> myself.

I think I smoked out what you were doing.


William Lee Irwin III wrote:
>> If you want to choose a "quasi-inter-arrival time" to achieve the
>> specified CPU bandwidth allocation, this would be it, but to use that
>> to actually enforce the CPU bandwidth allocation, you would need to
>> take into account the genuine inter-arrival time to choose an actual
>> time for service to begin. In other words, this should be a quota for
>> the task to have waited. If it's not waited long enough, then it should
>> be delayed by the difference to achieve the inter-arrival time you're
>> trying to enforce. If it's waited longer, it should be executed
>> sooner modulo other constraints, and perhaps even credited for future
>> scheduling cycles.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> The idea isn't to enforce the bandwidth entitlement to the extent of 
> throttling tasks if they exceed their entitlement and there's no other 
> tasks ready to use the CPU.  This is mainly because the bandwidth 
> entitlement isn't fixed -- it's changing constantly as the number and 
> type of runnable tasks changes.

Well, a little hysteresis will end up throttling in such a manner
anyway as a side-effect, or you'll get anomalies. Say two tasks with
equal entitlements compete, where one sleeps for 1/3 of the time and
the other is fully CPU-bound. If only the times when they're in direct
competition are split 50/50, then the CPU-bound task gets 2/3 and the
sleeper 1/3, which is not the intended effect. I don't believe this
model will be very vulnerable to it, though.


William Lee Irwin III wrote:
>> In order to partially avoid underallocating CPU bandwidth to p, one
>> should track the time last spent sleeping and do the following:

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> Yes I made a mistake in omitting to take into account sleep interval. 
> See another e-mail to Ingo correcting this problem.

I took it to be less trivial of an error than it was. No big deal.


William Lee Irwin III wrote:
>> In order to do better, longer-term history is required,

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> The half life of the Kalman filter (roughly equivalent to a running 
> average) used to calculate the averages determines how much history is 
> taken into account.  It could be made configurable (at least, until 
> enough real life experience was available to decide on the best value to 
> use).

A Kalman filter would do better than a running average. I'm all for it.


William Lee Irwin III wrote:
>> To attempt to maintain an infinite history of
>> bandwidth underutilization to be credited too far in the future would
>> enable potentially long-term overutilization when such credits are
>> cashed en masse for a sustained period of time. At some point you have
>> to say "use it or lose it;" over a shorter period of time some smoothing
>> is still admissible and even desirable.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
> Yes, that's why I suggest a running average over the last few scheduling 
> cycles for the task.  But thinking about it some more I'm now not so 
> sure.  The lack of apparent "smoothness" when I've done this sort of 
> thing with raw rather than smooth data (in modifications to the current 
> dynamic priority based scheduler model) is usually noticed by running 
> top and seeing wildly fluctuating dynamic priorities.  I'm not sure that 
> the actual responsiveness of the system reflects this.  So I'm now 
> willing to reserve my 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

William Lee Irwin III wrote:

On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it onto 
the CPU and then place it into an ordered list (based on this value) of 
tasks waiting for the CPU.  I think that this is a great idea and my 
suggestion is with regard to a method for working out this time that 
takes into account both fairness and nice.


Hmm. Let me take a look...


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
First suppose we have the following metrics available in addition to 
what's already provided.

rq->avg_weight_load /* a running average of the weighted load on the CPU */
p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
CPU each scheduling cycle */


I'm suspicious of mean service times not paired with mean inter-arrival
times.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off the 
CPU either voluntarily or after being preempted.  So 
p->avg_cpu_per_cycle is just the average amount of time p spends on the 
CPU each time it gets on to the CPU.  Sorry for the long explanation 
here but I just wanted to make sure there was no chance that "scheduling 
cycle" would be construed as some mechanism being imposed on the scheduler.)


I went and looked up priority queueing queueing theory garbage and
re-derived various things I needed. The basics check out. Probably no
one cares that I checked.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:

We can then define:
effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
If p is just waking (i.e. it's not on the queue and its load_weight is 
not included in rq->raw_weighted_load) and we need to queue it, we say 
that the maximum time (in all fairness) that p should have to wait to 
get onto the CPU is:
expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
p->load_weight


You're right.  The time that the task spent sleeping before being woken 
should be subtracted from this value.  If the answer is less than or 
equal to zero pre-emption should occur.




This doesn't look right, probably because the scaling factor of
p->avg_cpu_per_cycle is the reciprocal of its additive contribution to
the ->avg_weight_load as opposed to a direct estimate of its initial
delay or waiting time before completing its current requested service.

p->load_weight/effective_weighted_load more properly represents an
entitlement to CPU bandwidth.


Yes.  But expected_wait isn't entitlement it's its inverse.


p->avg_cpu_per_cycle/(p->load_weight/effective_weighted_load)
would be more like the expected time spent on the runqueue


When I went to school that would be just another way of expressing the 
equation that I expressed.



(whether
waiting to run or actually running) for a time interval spent in a
runnable state and the expected time runnable and waiting to run in such
an interval would be
p->avg_cpu_per_cycle*(1-effective_weighted_load/p->load_weight),

Neither represents the initial delay between entering the runqeueue and
first acquiring the CPU, but that's a bit hard to figure out without
deciding the scheduling policy up-front anyway.

This essentially doesn't look correct because while you want to enforce
the CPU bandwidth allocation, this doesn't have much to do with that
apart from the CPU bandwidth appearing as a term. It's more properly
a rate of service as opposed to a time at which anything should happen
or a number useful for predicting such. When service should begin more
properly depends on the other tasks in the system and a number of other
decisions that are part of the scheduling policy.


This model takes all of those into consideration.  The idea is not just 
to predict but to use the calculated time to decide when to boot the 
current process of the CPU (if it doesn't leave voluntarily) and put 
this one on.  This more or less removes the need to give each task a 
predetermined chunk of CPU when they go on to the CPU.  This should, in 
general, reduce the number context switches as tasks get to run until 
they've finished what they're doing or another task becomes higher 
priority rather than being booted off after an arbitrary time interval. 
 (If this ever gets tried it will be interesting to see if this 
prediction comes true.)


BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
patch (way in the future after Ingo's changes are settled) to try it out 
myself.




If you want to choose a "quasi-inter-arrival time" to achieve the
specified CPU bandwidth allocation, this would be it, but to use that
to actually enforce the CPU bandwidth allocation, you would 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
On Fri, 20 Apr 2007, William Lee Irwin III wrote:
>> I'm not really convinced it's all that worthwhile of an optimization,
>> essentially for the same reasons as you, but presumably there's a
>> benchmark result somewhere that says it matters. I've just not seen it.

On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote:
> If it is true that we frequently remotely write the per cpu runqueue 
> data then we may have a NUMA scalability issue.

>From the discussion on Suresh's thread, it appears to have sped up a
database benchmark 0.5%.

Last I checked it was workload-dependent, but there were things that
hammer it. I mostly know of the remote wakeup issue, but there could
be other things besides wakeups that do it, too.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Christoph Lameter
On Fri, 20 Apr 2007, William Lee Irwin III wrote:

> I'm not really convinced it's all that worthwhile of an optimization,
> essentially for the same reasons as you, but presumably there's a
> benchmark result somewhere that says it matters. I've just not seen it.

If it is true that we frequently remotely write the per cpu runqueue 
data then we may have a NUMA scalability issue.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
>>> False sharing for a per cpu data structure? Are we updating that 
>>> structure from other processors?

On Fri, 20 Apr 2007, William Lee Irwin III wrote:
>> Primarily in the load balancer, but also in wakeups.

On Fri, Apr 20, 2007 at 12:33:13PM -0700, Christoph Lameter wrote:
> That is fairly rare I think. What other variables that are also writtten 
> frequently would cause false sharing?

I wrote that backward, sorry. Cross-CPU wakeups' frequency depend
heavily on the workload. Probably the only other case I can think of
is io_schedule() but that's not really significant.

I'm not really convinced it's all that worthwhile of an optimization,
essentially for the same reasons as you, but presumably there's a
benchmark result somewhere that says it matters. I've just not seen it.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Christoph Lameter
On Fri, 20 Apr 2007, William Lee Irwin III wrote:

> On Wed, 18 Apr 2007, William Lee Irwin III wrote:
> >> 
> >> Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.
> 
> On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
> > False sharing for a per cpu data structure? Are we updating that 
> > structure from other processors?
> 
> Primarily in the load balancer, but also in wakeups.

That is fairly rare I think. What other variables that are also writtten 
frequently would cause false sharing?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
On Wed, 18 Apr 2007, William Lee Irwin III wrote:
>> 
>> Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.

On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
> False sharing for a per cpu data structure? Are we updating that 
> structure from other processors?

Primarily in the load balancer, but also in wakeups.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Christoph Lameter
On Wed, 18 Apr 2007, William Lee Irwin III wrote:

> 
> Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.

False sharing for a per cpu data structure? Are we updating that 
structure from other processors?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Ingo Molnar

* Peter Williams <[EMAIL PROTECTED]> wrote:

> BTW Given that I'm right and dynamic priorities have been dispensed 
> with what do you intend exporting (in their place) to user space for 
> display by top and similar?

well i thought of only displaying static ones, i.e. like the current 
patch does.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Peter Williams wrote:

Ingo Molnar wrote:


 - bugfix: use constant offset factor for nice levels instead of
sched_granularity_ns. Thus nice levels work even if someone sets
sched_granularity_ns to 0. NOTE: nice support is still naive, i'll
address the many nice level related suggestions in -v4.


I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it onto 
the CPU and then place it into an ordered list (based on this value) of 
tasks waiting for the CPU.  I think that this is a great idea and my 
suggestion is with regard to a method for working out this time that 
takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq->avg_weight_load /* a running average of the weighted load on the CPU */
p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
CPU each scheduling cycle */


where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off the 
CPU either voluntarily or after being preempted.  So 
p->avg_cpu_per_cycle is just the average amount of time p spends on the 
CPU each time it gets on to the CPU.  Sorry for the long explanation 
here but I just wanted to make sure there was no chance that "scheduling 
cycle" would be construed as some mechanism being imposed on the 
scheduler.)


We can then define:

effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)

If p is just waking (i.e. it's not on the queue and its load_weight is 
not included in rq->raw_weighted_load) and we need to queue it, we say 
that the maximum time (in all fairness) that p should have to wait to 
get onto the CPU is:


expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
p->load_weight


I just realized that this is wrong for the case where p is being woken. 
 In that case the length of the last sleep should be subtracted from 
the above value and if the result is negative pre-empt straight away. 
So expected_wait becomes:


expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
p->load_weight - p->length_of_last_sleep


As the length of the last sleep will be being calculated during the wake 
process there's no real need for it to be a task field and a local 
variable could be used instead.


For a task being requeued during pre-emption the equation would be:

expected_wait = time_just_spent_on_the_cpu * effective_weighted_load / 
p->load_weight


as there was zero sleep since last time on CPU.  Using the actual time 
on the CPU so far this time (instead of the average) will compensate the 
task for being pre-empted.




Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
shift right per scheduling cycle of the task.  An additional cost is 
that you need a shift right to get the nanosecond value from value 
stored in the task struct. (i.e. the above code is simplified to give 
the general idea).  The average would be number of cycles based rather 
than time based and (happily) this simplifies the calculations.


If you don't like using the average CPU time per cycle, you could just 
use the length of time on the CPU for the last time the task was on the 
CPU.  This would simplify things a lot.


I've been thinking about the "jerkiness" (or lack of "smoothness") that 
I said would occur if smoothed averages weren't used and I've realised 
that what I was talking about was observed jerkiness in the dynamic 
priorities of the tasks.  As this scheduler has dispensed with dynamic 
priorities (as far as I can see) the jerkiness probably won't be apparent.


BTW Given that I'm right and dynamic priorities have been dispensed with 
what do you intend exporting (in their place) to user space for display 
by top and similar?


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> I have a suggestion I'd like to make that addresses both nice and 
> fairness at the same time.  As I understand the basic principle behind 
> this scheduler it to work out a time by which a task should make it onto 
> the CPU and then place it into an ordered list (based on this value) of 
> tasks waiting for the CPU.  I think that this is a great idea and my 
> suggestion is with regard to a method for working out this time that 
> takes into account both fairness and nice.

Hmm. Let me take a look...


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> First suppose we have the following metrics available in addition to 
> what's already provided.
> rq->avg_weight_load /* a running average of the weighted load on the CPU */
> p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
> CPU each scheduling cycle */

I'm suspicious of mean service times not paired with mean inter-arrival
times.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> where a scheduling cycle for a task starts when it is placed on the 
> queue after waking or being preempted and ends when it is taken off the 
> CPU either voluntarily or after being preempted.  So 
> p->avg_cpu_per_cycle is just the average amount of time p spends on the 
> CPU each time it gets on to the CPU.  Sorry for the long explanation 
> here but I just wanted to make sure there was no chance that "scheduling 
> cycle" would be construed as some mechanism being imposed on the scheduler.)

I went and looked up priority queueing queueing theory garbage and
re-derived various things I needed. The basics check out. Probably no
one cares that I checked.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> We can then define:
> effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
> If p is just waking (i.e. it's not on the queue and its load_weight is 
> not included in rq->raw_weighted_load) and we need to queue it, we say 
> that the maximum time (in all fairness) that p should have to wait to 
> get onto the CPU is:
> expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
> p->load_weight

This doesn't look right, probably because the scaling factor of
p->avg_cpu_per_cycle is the reciprocal of its additive contribution to
the ->avg_weight_load as opposed to a direct estimate of its initial
delay or waiting time before completing its current requested service.

p->load_weight/effective_weighted_load more properly represents an
entitlement to CPU bandwidth.
p->avg_cpu_per_cycle/(p->load_weight/effective_weighted_load)
would be more like the expected time spent on the runqueue (whether
waiting to run or actually running) for a time interval spent in a
runnable state and the expected time runnable and waiting to run in such
an interval would be
p->avg_cpu_per_cycle*(1-effective_weighted_load/p->load_weight),

Neither represents the initial delay between entering the runqeueue and
first acquiring the CPU, but that's a bit hard to figure out without
deciding the scheduling policy up-front anyway.

This essentially doesn't look correct because while you want to enforce
the CPU bandwidth allocation, this doesn't have much to do with that
apart from the CPU bandwidth appearing as a term. It's more properly
a rate of service as opposed to a time at which anything should happen
or a number useful for predicting such. When service should begin more
properly depends on the other tasks in the system and a number of other
decisions that are part of the scheduling policy.

If you want to choose a "quasi-inter-arrival time" to achieve the
specified CPU bandwidth allocation, this would be it, but to use that
to actually enforce the CPU bandwidth allocation, you would need to
take into account the genuine inter-arrival time to choose an actual
time for service to begin. In other words, this should be a quota for
the task to have waited. If it's not waited long enough, then it should
be delayed by the difference to achieve the inter-arrival time you're
trying to enforce. If it's waited longer, it should be executed
sooner modulo other constraints, and perhaps even credited for future
scheduling cycles.

In order to partially avoid underallocating CPU bandwidth to p, one
should track the time last spent sleeping and do the following:
last_sleep = now - p->last_went_to_sleep;
wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight;
wait = wait > last_sleep ? wait - last_sleep : 0;

By and large the scheduler can deterministically choose waits on the
runqueue but it doesn't actually do that. Some additional corrections
for delays beyond those decided up-front while on the runqueue or
getting scheduled early on the runqueue may also help, though I'm not
as interested in them as I am the one for sleep:
last_wait = now - p->last_taken_off_the_cpu;
wait = 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Peter Williams wrote:

Ingo Molnar wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:


- bugfix: use constant offset factor for nice levels instead of
  sched_granularity_ns. Thus nice levels work even if someone sets   
sched_granularity_ns to 0. NOTE: nice support is still naive, i'll   
address the many nice level related suggestions in -v4.
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle 
behind this scheduler it to work out a time by which a task should 
make it onto the CPU and then place it into an ordered list (based on 
this value) of tasks waiting for the CPU. I think that this is a 
great idea [...]


yes, that's exactly the main idea behind CFS, and thanks for the 
compliment :)


Under this concept the scheduler never really has to guess: every 
scheduler decision derives straight from the relatively simple 
one-sentence (!) scheduling concept outlined above. Everything that 
tasks 'get' is something they 'earned' before and all the scheduler 
does are micro-decisions based on math with the nanosec-granularity 
values. Both the rbtree and nanosec accounting are a straight 
consequence of this too: they are the tools that allow the 
implementation of this concept in the highest-quality way. It's 
certainly a very exciting experiment to me and the feedback 'from the 
field' is very promising so far.


[...] and my suggestion is with regard to a method for working out 
this time that takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq->avg_weight_load /* a running average of the weighted load on the 
CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p 
spends on the CPU each scheduling cycle */


yes. rq->nr_running is really just a first-level approximation of 
rq->raw_weighted_load. I concentrated on the 'nice 0' case initially.


I appreciate that the notion of basing the expected wait on the 
task's average cpu use per scheduling cycle is counter intuitive but 
I believe that (if you think about it) you'll see that it actually 
makes sense.


hm. So far i tried to not do any statistical approach anywhere: the 
p->wait_runtime metric (which drives the task ordering) is in essence 
an absolutely precise 'integral' of the 'expected runtimes' that the 
task observes and hence is a precise "load-average as observed by the 
task"


To me this is statistics :-)

in itself. Every time we base some metric on an average value we 
introduce noise into the system.


i definitely agree with your suggestion that CFS should use a 
nice-scaled metric for 'load' instead of the current rq->nr_running, 
but regarding the basic calculations i'd rather lean towards using 
rq->raw_weighted_load. Hm?


This can result in jerkiness (in my experience) but using the smoothed 
version is certainly something that can be tried later rather than 
sooner.  Perhaps just something to bear in mind as a solution to 
"jerkiness" if it manifests.




your suggestion concentrates on the following scenario: if a task 
happens to schedule in an 'unlucky' way and happens to hit a busy 
period while there are many idle periods. Unless i misunderstood your 
suggestion, that is the main intention behind it, correct?


You misunderstand (that's one of my other schedulers :-)).  This one's 
based on the premise that if everything happens as the task expects it 
will get the amount of CPU bandwidth (over this short period) that it's 
entitled to.  In reality, sometimes it will get more and sometimes less 
but on average it should get what it deserves. E.g. If you had two tasks 
with equal nice and both had demands of 90% of a CPU you'd expect them 
each to get about half of the CPU bandwidth.  Now suppose that one of 
them uses 5ms of CPU each time it got onto the CPU and the other uses 
10ms.  If these two tasks just round robin with each other the likely 
outcome is that the one with the 10ms bursts will get twice as much CPU 
as the other but my proposed method should prevent and cause them to get 
roughly the same amount of CPU.  (I believe this was a scenario that 
caused problems with O(1) and required a fix at some stage?)


BTW this has the advantage that the decay rate used in calculating the 
task's statistics can be used to control how quickly the scheduler 
reacts to changes in the task's behaviour.


I think that, with this model, if the current task hasn't surrendered 
the CPU when the next task on the queue's "on CPU" time arrives that the 
current task should be pre-empted in favour of that task.  I'm not sure 
what would be the best way to implement this.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:


- bugfix: use constant offset factor for nice levels instead of
  sched_granularity_ns. Thus nice levels work even if someone sets 
  sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
  address the many nice level related suggestions in -v4.
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it 
onto the CPU and then place it into an ordered list (based on this 
value) of tasks waiting for the CPU. I think that this is a great idea 
[...]


yes, that's exactly the main idea behind CFS, and thanks for the 
compliment :)


Under this concept the scheduler never really has to guess: every 
scheduler decision derives straight from the relatively simple 
one-sentence (!) scheduling concept outlined above. Everything that 
tasks 'get' is something they 'earned' before and all the scheduler does 
are micro-decisions based on math with the nanosec-granularity values. 
Both the rbtree and nanosec accounting are a straight consequence of 
this too: they are the tools that allow the implementation of this 
concept in the highest-quality way. It's certainly a very exciting 
experiment to me and the feedback 'from the field' is very promising so 
far.


[...] and my suggestion is with regard to a method for working out 
this time that takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq->avg_weight_load /* a running average of the weighted load on the 
CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends 
on the CPU each scheduling cycle */


yes. rq->nr_running is really just a first-level approximation of 
rq->raw_weighted_load. I concentrated on the 'nice 0' case initially.


I appreciate that the notion of basing the expected wait on the task's 
average cpu use per scheduling cycle is counter intuitive but I 
believe that (if you think about it) you'll see that it actually makes 
sense.


hm. So far i tried to not do any statistical approach anywhere: the 
p->wait_runtime metric (which drives the task ordering) is in essence an 
absolutely precise 'integral' of the 'expected runtimes' that the task 
observes and hence is a precise "load-average as observed by the task"


To me this is statistics :-)

in itself. Every time we base some metric on an average value we 
introduce noise into the system.


i definitely agree with your suggestion that CFS should use a 
nice-scaled metric for 'load' instead of the current rq->nr_running, but 
regarding the basic calculations i'd rather lean towards using 
rq->raw_weighted_load. Hm?


This can result in jerkiness (in my experience) but using the smoothed 
version is certainly something that can be tried later rather than 
sooner.  Perhaps just something to bear in mind as a solution to 
"jerkiness" if it manifests.




your suggestion concentrates on the following scenario: if a task 
happens to schedule in an 'unlucky' way and happens to hit a busy period 
while there are many idle periods. Unless i misunderstood your 
suggestion, that is the main intention behind it, correct?


You misunderstand (that's one of my other schedulers :-)).  This one's 
based on the premise that if everything happens as the task expects it 
will get the amount of CPU bandwidth (over this short period) that it's 
entitled to.  In reality, sometimes it will get more and sometimes less 
but on average it should get what it deserves. E.g. If you had two tasks 
with equal nice and both had demands of 90% of a CPU you'd expect them 
each to get about half of the CPU bandwidth.  Now suppose that one of 
them uses 5ms of CPU each time it got onto the CPU and the other uses 
10ms.  If these two tasks just round robin with each other the likely 
outcome is that the one with the 10ms bursts will get twice as much CPU 
as the other but my proposed method should prevent and cause them to get 
roughly the same amount of CPU.  (I believe this was a scenario that 
caused problems with O(1) and required a fix at some stage?)


BTW this has the advantage that the decay rate used in calculating the 
task's statistics can be used to control how quickly the scheduler 
reacts to changes in the task's behaviour.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Willy Tarreau
On Fri, Apr 20, 2007 at 04:02:41PM +1000, Peter Williams wrote:
> Willy Tarreau wrote:
> >On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> >>Ingo Molnar wrote:
> >>>- bugfix: use constant offset factor for nice levels instead of 
> >>>  sched_granularity_ns. Thus nice levels work even if someone sets 
> >>>  sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
> >>>  address the many nice level related suggestions in -v4.
> >>I have a suggestion I'd like to make that addresses both nice and 
> >>fairness at the same time.  As I understand the basic principle behind 
> >>this scheduler it to work out a time by which a task should make it onto 
> >>the CPU and then place it into an ordered list (based on this value) of 
> >>tasks waiting for the CPU.  I think that this is a great idea and my 
> >>suggestion is with regard to a method for working out this time that 
> >>takes into account both fairness and nice.
> >>
> >>First suppose we have the following metrics available in addition to 
> >>what's already provided.
> >>
> >>rq->avg_weight_load /* a running average of the weighted load on the CPU 
> >>*/
> >>p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
> >>CPU each scheduling cycle */
> >>
> >>where a scheduling cycle for a task starts when it is placed on the 
> >>queue after waking or being preempted and ends when it is taken off the 
> >>CPU either voluntarily or after being preempted.  So 
> >>p->avg_cpu_per_cycle is just the average amount of time p spends on the 
> >>CPU each time it gets on to the CPU.  Sorry for the long explanation 
> >>here but I just wanted to make sure there was no chance that "scheduling 
> >>cycle" would be construed as some mechanism being imposed on the 
> >>scheduler.)
> >>
> >>We can then define:
> >>
> >>effective_weighted_load = max(rq->raw_weighted_load, 
> >>rq->avg_weighted_load)
> >>
> >>If p is just waking (i.e. it's not on the queue and its load_weight is 
> >>not included in rq->raw_weighted_load) and we need to queue it, we say 
> >>that the maximum time (in all fairness) that p should have to wait to 
> >>get onto the CPU is:
> >>
> >>expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
> >>p->load_weight
> >>
> >>Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
> >>shift right per scheduling cycle of the task.  An additional cost is 
> >>that you need a shift right to get the nanosecond value from value 
> >>stored in the task struct. (i.e. the above code is simplified to give 
> >>the general idea).  The average would be number of cycles based rather 
> >>than time based and (happily) this simplifies the calculations.
> >>
> >>If the expected time to get onto the CPU (i.e. expected_wait plus the 
> >>current time) for p is earlier than the equivalent time for the 
> >>currently running task then preemption of that task would be justified.
> >
> >I 100% agree on this method because I came to nearly the same conclusion on
> >paper about 1 year ago. What I'd like to add is that the expected wake up 
> >time
> >is not the most precise criterion for fairness.
> 
> It's not the expected wake up time being computed.  It's the expected 
> time that the task is selected to run after being put on the queue 
> (either as a result of waking or being pre-empted).

Sorry, maybe I've not chosen the appropriate words. When I said "wakeup time",
I meant "the date at which the task must reach the CPU". They are the same
when the task is alone, but different when other tasks are run before.

But my goal definitely was to avoid using this alone and use the expected
completion time instead, from which the "on CPU" date can be computed.

The on-CPU time should be reduced if the task ran too long on previous
iterations, and enlarged if the task did not consume all of its last
timeslice, hence the idea behind the credit. I think that it is important
for streaming processes (eg: mp3 players) which will not always consume
the same amount of CPU but are very close to an average after across a
few timeslices.

[...]

> In practice, it might be prudent to use the idea of maximum expected "on 
> CPU" time instead of the average (especially if you're going to use this 
> for triggering pre-emption).

yes

> This would be the average plus two or 
> three time the standard deviation and has the down side that you have to 
> calculate the standard deviation and that means a square root (of course 
> we could use some less mathematically rigorous metric to approximate the 
> standard deviation).

You just have not to do the square root, and compare the unsquared values
instead.

> The advantage would be that streamer type programs 
> such as audio/video applications would have very small standard 
> deviations and would hence get earlier expected "on CPU" times than 
> other tasks with similar averages but more random distribution.

I like this idea.

Regards,
Willy

-
To unsubscribe from this list: 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Ingo Molnar

* Peter Williams <[EMAIL PROTECTED]> wrote:

> > - bugfix: use constant offset factor for nice levels instead of
> >   sched_granularity_ns. Thus nice levels work even if someone sets 
> >   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
> >   address the many nice level related suggestions in -v4.
> 
> I have a suggestion I'd like to make that addresses both nice and 
> fairness at the same time.  As I understand the basic principle behind 
> this scheduler it to work out a time by which a task should make it 
> onto the CPU and then place it into an ordered list (based on this 
> value) of tasks waiting for the CPU. I think that this is a great idea 
> [...]

yes, that's exactly the main idea behind CFS, and thanks for the 
compliment :)

Under this concept the scheduler never really has to guess: every 
scheduler decision derives straight from the relatively simple 
one-sentence (!) scheduling concept outlined above. Everything that 
tasks 'get' is something they 'earned' before and all the scheduler does 
are micro-decisions based on math with the nanosec-granularity values. 
Both the rbtree and nanosec accounting are a straight consequence of 
this too: they are the tools that allow the implementation of this 
concept in the highest-quality way. It's certainly a very exciting 
experiment to me and the feedback 'from the field' is very promising so 
far.

> [...] and my suggestion is with regard to a method for working out 
> this time that takes into account both fairness and nice.
> 
> First suppose we have the following metrics available in addition to 
> what's already provided.
> 
> rq->avg_weight_load /* a running average of the weighted load on the 
> CPU */ p->avg_cpu_per_cycle /* the average time in nsecs that p spends 
> on the CPU each scheduling cycle */

yes. rq->nr_running is really just a first-level approximation of 
rq->raw_weighted_load. I concentrated on the 'nice 0' case initially.

> I appreciate that the notion of basing the expected wait on the task's 
> average cpu use per scheduling cycle is counter intuitive but I 
> believe that (if you think about it) you'll see that it actually makes 
> sense.

hm. So far i tried to not do any statistical approach anywhere: the 
p->wait_runtime metric (which drives the task ordering) is in essence an 
absolutely precise 'integral' of the 'expected runtimes' that the task 
observes and hence is a precise "load-average as observed by the task" 
in itself. Every time we base some metric on an average value we 
introduce noise into the system.

i definitely agree with your suggestion that CFS should use a 
nice-scaled metric for 'load' instead of the current rq->nr_running, but 
regarding the basic calculations i'd rather lean towards using 
rq->raw_weighted_load. Hm?

your suggestion concentrates on the following scenario: if a task 
happens to schedule in an 'unlucky' way and happens to hit a busy period 
while there are many idle periods. Unless i misunderstood your 
suggestion, that is the main intention behind it, correct?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Peter Williams wrote:

Willy Tarreau wrote:

On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:

Ingo Molnar wrote:
- bugfix: use constant offset factor for nice levels instead of   
sched_granularity_ns. Thus nice levels work even if someone sets   
sched_granularity_ns to 0. NOTE: nice support is still naive, i'll   
address the many nice level related suggestions in -v4.
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle 
behind this scheduler it to work out a time by which a task should 
make it onto the CPU and then place it into an ordered list (based on 
this value) of tasks waiting for the CPU.  I think that this is a 
great idea and my suggestion is with regard to a method for working 
out this time that takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq->avg_weight_load /* a running average of the weighted load on the 
CPU */
p->avg_cpu_per_cycle /* the average time in nsecs that p spends on 
the CPU each scheduling cycle */


where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off 
the CPU either voluntarily or after being preempted.  So 
p->avg_cpu_per_cycle is just the average amount of time p spends on 
the CPU each time it gets on to the CPU.  Sorry for the long 
explanation here but I just wanted to make sure there was no chance 
that "scheduling cycle" would be construed as some mechanism being 
imposed on the scheduler.)


We can then define:

effective_weighted_load = max(rq->raw_weighted_load, 
rq->avg_weighted_load)


If p is just waking (i.e. it's not on the queue and its load_weight 
is not included in rq->raw_weighted_load) and we need to queue it, we 
say that the maximum time (in all fairness) that p should have to 
wait to get onto the CPU is:


expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
p->load_weight


Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
shift right per scheduling cycle of the task.  An additional cost is 
that you need a shift right to get the nanosecond value from value 
stored in the task struct. (i.e. the above code is simplified to give 
the general idea).  The average would be number of cycles based 
rather than time based and (happily) this simplifies the calculations.


If the expected time to get onto the CPU (i.e. expected_wait plus the 
current time) for p is earlier than the equivalent time for the 
currently running task then preemption of that task would be justified.


I 100% agree on this method because I came to nearly the same 
conclusion on
paper about 1 year ago. What I'd like to add is that the expected wake 
up time

is not the most precise criterion for fairness.


It's not the expected wake up time being computed.  It's the expected 
time that the task is selected to run after being put on the queue 
(either as a result of waking or being pre-empted).


I think that your comments below are mainly invalid because of this 
understanding.



The expected completion
time is better. When you have one task t1 which is expected to run for T1
nanosecs and another task t2 which is expected to run for T2, what is
important for the user for fairness is when the task completes its 
work. If
t1 should wake up at time W1 and t2 at W2, then the list should be 
ordered

by comparing W1+T1 and W2+T2.


This is a point where misunderstanding results in a wrong conclusion. 
However, the notion of expected completion time is useful.  I'd 
calculate the expected completion time for the task as it goes on to the 
CPU and if while it is running a new task is queued whose expected "on 
CPU" time is before the current task's expected end time you can think 
about preempting when the new tasks "on CPU" task arrives -- how 
hard/easy this would be to implement is moot.




What I like with this method is that it remains fair with nice tasks 
because
because in order to renice a task tN, you just have to change TN, and 
if it
has to run shorter, it can be executed before CPU hogs and stay there 
for a

very short time.

Also, I found that if we want to respect interactivity, we must 
conserve a

credit for each task.


This is one point where the misunderstanding results in an invalid 
conclusion.  There is no need for credits as the use of the average time 
on CPU each cycle makes the necessary corrections that credits would be 
used to address.



It is a bounded amount of CPU time left to be used. When
the task t3 has the right to use T3 nsecs, and wakes up at W3, if it 
does not
spend T3 nsec on the CPU, but only N3computed

like this :

  C3 = MAX(MAX_CREDIT, C3 + T3 - N3)

And if a CPU hog uses more than its assigned time slice due to scheduler
resolution, then C3 can become negative (and bounded too) :

  C3 = MAX(MIN_CREDIT, C3 + T3 - N3)

Now how is 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Willy Tarreau wrote:

On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:

Ingo Molnar wrote:
- bugfix: use constant offset factor for nice levels instead of 
  sched_granularity_ns. Thus nice levels work even if someone sets 
  sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
  address the many nice level related suggestions in -v4.
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it onto 
the CPU and then place it into an ordered list (based on this value) of 
tasks waiting for the CPU.  I think that this is a great idea and my 
suggestion is with regard to a method for working out this time that 
takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq->avg_weight_load /* a running average of the weighted load on the CPU */
p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
CPU each scheduling cycle */


where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off the 
CPU either voluntarily or after being preempted.  So 
p->avg_cpu_per_cycle is just the average amount of time p spends on the 
CPU each time it gets on to the CPU.  Sorry for the long explanation 
here but I just wanted to make sure there was no chance that "scheduling 
cycle" would be construed as some mechanism being imposed on the scheduler.)


We can then define:

effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)

If p is just waking (i.e. it's not on the queue and its load_weight is 
not included in rq->raw_weighted_load) and we need to queue it, we say 
that the maximum time (in all fairness) that p should have to wait to 
get onto the CPU is:


expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
p->load_weight


Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
shift right per scheduling cycle of the task.  An additional cost is 
that you need a shift right to get the nanosecond value from value 
stored in the task struct. (i.e. the above code is simplified to give 
the general idea).  The average would be number of cycles based rather 
than time based and (happily) this simplifies the calculations.


If the expected time to get onto the CPU (i.e. expected_wait plus the 
current time) for p is earlier than the equivalent time for the 
currently running task then preemption of that task would be justified.


I 100% agree on this method because I came to nearly the same conclusion on
paper about 1 year ago. What I'd like to add is that the expected wake up time
is not the most precise criterion for fairness.


It's not the expected wake up time being computed.  It's the expected 
time that the task is selected to run after being put on the queue 
(either as a result of waking or being pre-empted).


I think that your comments below are mainly invalid because of this 
understanding.



The expected completion
time is better. When you have one task t1 which is expected to run for T1
nanosecs and another task t2 which is expected to run for T2, what is
important for the user for fairness is when the task completes its work. If
t1 should wake up at time W1 and t2 at W2, then the list should be ordered
by comparing W1+T1 and W2+T2.


This is a point where misunderstanding results in a wrong conclusion. 
However, the notion of expected completion time is useful.  I'd 
calculate the expected completion time for the task as it goes on to the 
CPU and if while it is running a new task is queued whose expected "on 
CPU" time is before the current task's expected end time you can think 
about preempting when the new tasks "on CPU" task arrives -- how 
hard/easy this would be to implement is moot.




What I like with this method is that it remains fair with nice tasks because
because in order to renice a task tN, you just have to change TN, and if it
has to run shorter, it can be executed before CPU hogs and stay there for a
very short time.

Also, I found that if we want to respect interactivity, we must conserve a
credit for each task.


This is one point where the misunderstanding results in an invalid 
conclusion.  There is no need for credits as the use of the average time 
on CPU each cycle makes the necessary corrections that credits would be 
used to address.



It is a bounded amount of CPU time left to be used. When
the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not
spend T3 nsec on the CPU, but only N3

I think this is overcomplicating matters.

However, some compensating credit might be in order for tasks that get 
pre-empted e.g. base their next expected "on CPU" time on the difference 
between their average on CPU time and they amount they got to use before 
they were pre-empted.

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Willy Tarreau wrote:

On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:

Ingo Molnar wrote:
- bugfix: use constant offset factor for nice levels instead of 
  sched_granularity_ns. Thus nice levels work even if someone sets 
  sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
  address the many nice level related suggestions in -v4.
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it onto 
the CPU and then place it into an ordered list (based on this value) of 
tasks waiting for the CPU.  I think that this is a great idea and my 
suggestion is with regard to a method for working out this time that 
takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq-avg_weight_load /* a running average of the weighted load on the CPU */
p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
CPU each scheduling cycle */


where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off the 
CPU either voluntarily or after being preempted.  So 
p-avg_cpu_per_cycle is just the average amount of time p spends on the 
CPU each time it gets on to the CPU.  Sorry for the long explanation 
here but I just wanted to make sure there was no chance that scheduling 
cycle would be construed as some mechanism being imposed on the scheduler.)


We can then define:

effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load)

If p is just waking (i.e. it's not on the queue and its load_weight is 
not included in rq-raw_weighted_load) and we need to queue it, we say 
that the maximum time (in all fairness) that p should have to wait to 
get onto the CPU is:


expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / 
p-load_weight


Calculating p-avg_cpu_per_cycle costs one add, one multiply and one 
shift right per scheduling cycle of the task.  An additional cost is 
that you need a shift right to get the nanosecond value from value 
stored in the task struct. (i.e. the above code is simplified to give 
the general idea).  The average would be number of cycles based rather 
than time based and (happily) this simplifies the calculations.


If the expected time to get onto the CPU (i.e. expected_wait plus the 
current time) for p is earlier than the equivalent time for the 
currently running task then preemption of that task would be justified.


I 100% agree on this method because I came to nearly the same conclusion on
paper about 1 year ago. What I'd like to add is that the expected wake up time
is not the most precise criterion for fairness.


It's not the expected wake up time being computed.  It's the expected 
time that the task is selected to run after being put on the queue 
(either as a result of waking or being pre-empted).


I think that your comments below are mainly invalid because of this 
understanding.



The expected completion
time is better. When you have one task t1 which is expected to run for T1
nanosecs and another task t2 which is expected to run for T2, what is
important for the user for fairness is when the task completes its work. If
t1 should wake up at time W1 and t2 at W2, then the list should be ordered
by comparing W1+T1 and W2+T2.


This is a point where misunderstanding results in a wrong conclusion. 
However, the notion of expected completion time is useful.  I'd 
calculate the expected completion time for the task as it goes on to the 
CPU and if while it is running a new task is queued whose expected on 
CPU time is before the current task's expected end time you can think 
about preempting when the new tasks on CPU task arrives -- how 
hard/easy this would be to implement is moot.




What I like with this method is that it remains fair with nice tasks because
because in order to renice a task tN, you just have to change TN, and if it
has to run shorter, it can be executed before CPU hogs and stay there for a
very short time.

Also, I found that if we want to respect interactivity, we must conserve a
credit for each task.


This is one point where the misunderstanding results in an invalid 
conclusion.  There is no need for credits as the use of the average time 
on CPU each cycle makes the necessary corrections that credits would be 
used to address.



It is a bounded amount of CPU time left to be used. When
the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not
spend T3 nsec on the CPU, but only N3T3, then we have a credit C3 computed
like this :

  C3 = MAX(MAX_CREDIT, C3 + T3 - N3)

And if a CPU hog uses more than its assigned time slice due to scheduler
resolution, then C3 can become negative (and bounded too) :

  C3 = MAX(MIN_CREDIT, C3 + T3 - N3)

Now how is the credit used ? Simple: the 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Peter Williams wrote:

Willy Tarreau wrote:

On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:

Ingo Molnar wrote:
- bugfix: use constant offset factor for nice levels instead of   
sched_granularity_ns. Thus nice levels work even if someone sets   
sched_granularity_ns to 0. NOTE: nice support is still naive, i'll   
address the many nice level related suggestions in -v4.
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle 
behind this scheduler it to work out a time by which a task should 
make it onto the CPU and then place it into an ordered list (based on 
this value) of tasks waiting for the CPU.  I think that this is a 
great idea and my suggestion is with regard to a method for working 
out this time that takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq-avg_weight_load /* a running average of the weighted load on the 
CPU */
p-avg_cpu_per_cycle /* the average time in nsecs that p spends on 
the CPU each scheduling cycle */


where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off 
the CPU either voluntarily or after being preempted.  So 
p-avg_cpu_per_cycle is just the average amount of time p spends on 
the CPU each time it gets on to the CPU.  Sorry for the long 
explanation here but I just wanted to make sure there was no chance 
that scheduling cycle would be construed as some mechanism being 
imposed on the scheduler.)


We can then define:

effective_weighted_load = max(rq-raw_weighted_load, 
rq-avg_weighted_load)


If p is just waking (i.e. it's not on the queue and its load_weight 
is not included in rq-raw_weighted_load) and we need to queue it, we 
say that the maximum time (in all fairness) that p should have to 
wait to get onto the CPU is:


expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / 
p-load_weight


Calculating p-avg_cpu_per_cycle costs one add, one multiply and one 
shift right per scheduling cycle of the task.  An additional cost is 
that you need a shift right to get the nanosecond value from value 
stored in the task struct. (i.e. the above code is simplified to give 
the general idea).  The average would be number of cycles based 
rather than time based and (happily) this simplifies the calculations.


If the expected time to get onto the CPU (i.e. expected_wait plus the 
current time) for p is earlier than the equivalent time for the 
currently running task then preemption of that task would be justified.


I 100% agree on this method because I came to nearly the same 
conclusion on
paper about 1 year ago. What I'd like to add is that the expected wake 
up time

is not the most precise criterion for fairness.


It's not the expected wake up time being computed.  It's the expected 
time that the task is selected to run after being put on the queue 
(either as a result of waking or being pre-empted).


I think that your comments below are mainly invalid because of this 
understanding.



The expected completion
time is better. When you have one task t1 which is expected to run for T1
nanosecs and another task t2 which is expected to run for T2, what is
important for the user for fairness is when the task completes its 
work. If
t1 should wake up at time W1 and t2 at W2, then the list should be 
ordered

by comparing W1+T1 and W2+T2.


This is a point where misunderstanding results in a wrong conclusion. 
However, the notion of expected completion time is useful.  I'd 
calculate the expected completion time for the task as it goes on to the 
CPU and if while it is running a new task is queued whose expected on 
CPU time is before the current task's expected end time you can think 
about preempting when the new tasks on CPU task arrives -- how 
hard/easy this would be to implement is moot.




What I like with this method is that it remains fair with nice tasks 
because
because in order to renice a task tN, you just have to change TN, and 
if it
has to run shorter, it can be executed before CPU hogs and stay there 
for a

very short time.

Also, I found that if we want to respect interactivity, we must 
conserve a

credit for each task.


This is one point where the misunderstanding results in an invalid 
conclusion.  There is no need for credits as the use of the average time 
on CPU each cycle makes the necessary corrections that credits would be 
used to address.



It is a bounded amount of CPU time left to be used. When
the task t3 has the right to use T3 nsecs, and wakes up at W3, if it 
does not
spend T3 nsec on the CPU, but only N3T3, then we have a credit C3 
computed

like this :

  C3 = MAX(MAX_CREDIT, C3 + T3 - N3)

And if a CPU hog uses more than its assigned time slice due to scheduler
resolution, then C3 can become negative (and bounded too) :

  C3 = MAX(MIN_CREDIT, C3 + T3 - N3)

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Ingo Molnar

* Peter Williams [EMAIL PROTECTED] wrote:

  - bugfix: use constant offset factor for nice levels instead of
sched_granularity_ns. Thus nice levels work even if someone sets 
sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
address the many nice level related suggestions in -v4.
 
 I have a suggestion I'd like to make that addresses both nice and 
 fairness at the same time.  As I understand the basic principle behind 
 this scheduler it to work out a time by which a task should make it 
 onto the CPU and then place it into an ordered list (based on this 
 value) of tasks waiting for the CPU. I think that this is a great idea 
 [...]

yes, that's exactly the main idea behind CFS, and thanks for the 
compliment :)

Under this concept the scheduler never really has to guess: every 
scheduler decision derives straight from the relatively simple 
one-sentence (!) scheduling concept outlined above. Everything that 
tasks 'get' is something they 'earned' before and all the scheduler does 
are micro-decisions based on math with the nanosec-granularity values. 
Both the rbtree and nanosec accounting are a straight consequence of 
this too: they are the tools that allow the implementation of this 
concept in the highest-quality way. It's certainly a very exciting 
experiment to me and the feedback 'from the field' is very promising so 
far.

 [...] and my suggestion is with regard to a method for working out 
 this time that takes into account both fairness and nice.
 
 First suppose we have the following metrics available in addition to 
 what's already provided.
 
 rq-avg_weight_load /* a running average of the weighted load on the 
 CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends 
 on the CPU each scheduling cycle */

yes. rq-nr_running is really just a first-level approximation of 
rq-raw_weighted_load. I concentrated on the 'nice 0' case initially.

 I appreciate that the notion of basing the expected wait on the task's 
 average cpu use per scheduling cycle is counter intuitive but I 
 believe that (if you think about it) you'll see that it actually makes 
 sense.

hm. So far i tried to not do any statistical approach anywhere: the 
p-wait_runtime metric (which drives the task ordering) is in essence an 
absolutely precise 'integral' of the 'expected runtimes' that the task 
observes and hence is a precise load-average as observed by the task 
in itself. Every time we base some metric on an average value we 
introduce noise into the system.

i definitely agree with your suggestion that CFS should use a 
nice-scaled metric for 'load' instead of the current rq-nr_running, but 
regarding the basic calculations i'd rather lean towards using 
rq-raw_weighted_load. Hm?

your suggestion concentrates on the following scenario: if a task 
happens to schedule in an 'unlucky' way and happens to hit a busy period 
while there are many idle periods. Unless i misunderstood your 
suggestion, that is the main intention behind it, correct?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Willy Tarreau
On Fri, Apr 20, 2007 at 04:02:41PM +1000, Peter Williams wrote:
 Willy Tarreau wrote:
 On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
 Ingo Molnar wrote:
 - bugfix: use constant offset factor for nice levels instead of 
   sched_granularity_ns. Thus nice levels work even if someone sets 
   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
   address the many nice level related suggestions in -v4.
 I have a suggestion I'd like to make that addresses both nice and 
 fairness at the same time.  As I understand the basic principle behind 
 this scheduler it to work out a time by which a task should make it onto 
 the CPU and then place it into an ordered list (based on this value) of 
 tasks waiting for the CPU.  I think that this is a great idea and my 
 suggestion is with regard to a method for working out this time that 
 takes into account both fairness and nice.
 
 First suppose we have the following metrics available in addition to 
 what's already provided.
 
 rq-avg_weight_load /* a running average of the weighted load on the CPU 
 */
 p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
 CPU each scheduling cycle */
 
 where a scheduling cycle for a task starts when it is placed on the 
 queue after waking or being preempted and ends when it is taken off the 
 CPU either voluntarily or after being preempted.  So 
 p-avg_cpu_per_cycle is just the average amount of time p spends on the 
 CPU each time it gets on to the CPU.  Sorry for the long explanation 
 here but I just wanted to make sure there was no chance that scheduling 
 cycle would be construed as some mechanism being imposed on the 
 scheduler.)
 
 We can then define:
 
 effective_weighted_load = max(rq-raw_weighted_load, 
 rq-avg_weighted_load)
 
 If p is just waking (i.e. it's not on the queue and its load_weight is 
 not included in rq-raw_weighted_load) and we need to queue it, we say 
 that the maximum time (in all fairness) that p should have to wait to 
 get onto the CPU is:
 
 expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / 
 p-load_weight
 
 Calculating p-avg_cpu_per_cycle costs one add, one multiply and one 
 shift right per scheduling cycle of the task.  An additional cost is 
 that you need a shift right to get the nanosecond value from value 
 stored in the task struct. (i.e. the above code is simplified to give 
 the general idea).  The average would be number of cycles based rather 
 than time based and (happily) this simplifies the calculations.
 
 If the expected time to get onto the CPU (i.e. expected_wait plus the 
 current time) for p is earlier than the equivalent time for the 
 currently running task then preemption of that task would be justified.
 
 I 100% agree on this method because I came to nearly the same conclusion on
 paper about 1 year ago. What I'd like to add is that the expected wake up 
 time
 is not the most precise criterion for fairness.
 
 It's not the expected wake up time being computed.  It's the expected 
 time that the task is selected to run after being put on the queue 
 (either as a result of waking or being pre-empted).

Sorry, maybe I've not chosen the appropriate words. When I said wakeup time,
I meant the date at which the task must reach the CPU. They are the same
when the task is alone, but different when other tasks are run before.

But my goal definitely was to avoid using this alone and use the expected
completion time instead, from which the on CPU date can be computed.

The on-CPU time should be reduced if the task ran too long on previous
iterations, and enlarged if the task did not consume all of its last
timeslice, hence the idea behind the credit. I think that it is important
for streaming processes (eg: mp3 players) which will not always consume
the same amount of CPU but are very close to an average after across a
few timeslices.

[...]

 In practice, it might be prudent to use the idea of maximum expected on 
 CPU time instead of the average (especially if you're going to use this 
 for triggering pre-emption).

yes

 This would be the average plus two or 
 three time the standard deviation and has the down side that you have to 
 calculate the standard deviation and that means a square root (of course 
 we could use some less mathematically rigorous metric to approximate the 
 standard deviation).

You just have not to do the square root, and compare the unsquared values
instead.

 The advantage would be that streamer type programs 
 such as audio/video applications would have very small standard 
 deviations and would hence get earlier expected on CPU times than 
 other tasks with similar averages but more random distribution.

I like this idea.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams [EMAIL PROTECTED] wrote:


- bugfix: use constant offset factor for nice levels instead of
  sched_granularity_ns. Thus nice levels work even if someone sets 
  sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
  address the many nice level related suggestions in -v4.
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it 
onto the CPU and then place it into an ordered list (based on this 
value) of tasks waiting for the CPU. I think that this is a great idea 
[...]


yes, that's exactly the main idea behind CFS, and thanks for the 
compliment :)


Under this concept the scheduler never really has to guess: every 
scheduler decision derives straight from the relatively simple 
one-sentence (!) scheduling concept outlined above. Everything that 
tasks 'get' is something they 'earned' before and all the scheduler does 
are micro-decisions based on math with the nanosec-granularity values. 
Both the rbtree and nanosec accounting are a straight consequence of 
this too: they are the tools that allow the implementation of this 
concept in the highest-quality way. It's certainly a very exciting 
experiment to me and the feedback 'from the field' is very promising so 
far.


[...] and my suggestion is with regard to a method for working out 
this time that takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq-avg_weight_load /* a running average of the weighted load on the 
CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p spends 
on the CPU each scheduling cycle */


yes. rq-nr_running is really just a first-level approximation of 
rq-raw_weighted_load. I concentrated on the 'nice 0' case initially.


I appreciate that the notion of basing the expected wait on the task's 
average cpu use per scheduling cycle is counter intuitive but I 
believe that (if you think about it) you'll see that it actually makes 
sense.


hm. So far i tried to not do any statistical approach anywhere: the 
p-wait_runtime metric (which drives the task ordering) is in essence an 
absolutely precise 'integral' of the 'expected runtimes' that the task 
observes and hence is a precise load-average as observed by the task


To me this is statistics :-)

in itself. Every time we base some metric on an average value we 
introduce noise into the system.


i definitely agree with your suggestion that CFS should use a 
nice-scaled metric for 'load' instead of the current rq-nr_running, but 
regarding the basic calculations i'd rather lean towards using 
rq-raw_weighted_load. Hm?


This can result in jerkiness (in my experience) but using the smoothed 
version is certainly something that can be tried later rather than 
sooner.  Perhaps just something to bear in mind as a solution to 
jerkiness if it manifests.




your suggestion concentrates on the following scenario: if a task 
happens to schedule in an 'unlucky' way and happens to hit a busy period 
while there are many idle periods. Unless i misunderstood your 
suggestion, that is the main intention behind it, correct?


You misunderstand (that's one of my other schedulers :-)).  This one's 
based on the premise that if everything happens as the task expects it 
will get the amount of CPU bandwidth (over this short period) that it's 
entitled to.  In reality, sometimes it will get more and sometimes less 
but on average it should get what it deserves. E.g. If you had two tasks 
with equal nice and both had demands of 90% of a CPU you'd expect them 
each to get about half of the CPU bandwidth.  Now suppose that one of 
them uses 5ms of CPU each time it got onto the CPU and the other uses 
10ms.  If these two tasks just round robin with each other the likely 
outcome is that the one with the 10ms bursts will get twice as much CPU 
as the other but my proposed method should prevent and cause them to get 
roughly the same amount of CPU.  (I believe this was a scenario that 
caused problems with O(1) and required a fix at some stage?)


BTW this has the advantage that the decay rate used in calculating the 
task's statistics can be used to control how quickly the scheduler 
reacts to changes in the task's behaviour.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Peter Williams wrote:

Ingo Molnar wrote:

* Peter Williams [EMAIL PROTECTED] wrote:


- bugfix: use constant offset factor for nice levels instead of
  sched_granularity_ns. Thus nice levels work even if someone sets   
sched_granularity_ns to 0. NOTE: nice support is still naive, i'll   
address the many nice level related suggestions in -v4.
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle 
behind this scheduler it to work out a time by which a task should 
make it onto the CPU and then place it into an ordered list (based on 
this value) of tasks waiting for the CPU. I think that this is a 
great idea [...]


yes, that's exactly the main idea behind CFS, and thanks for the 
compliment :)


Under this concept the scheduler never really has to guess: every 
scheduler decision derives straight from the relatively simple 
one-sentence (!) scheduling concept outlined above. Everything that 
tasks 'get' is something they 'earned' before and all the scheduler 
does are micro-decisions based on math with the nanosec-granularity 
values. Both the rbtree and nanosec accounting are a straight 
consequence of this too: they are the tools that allow the 
implementation of this concept in the highest-quality way. It's 
certainly a very exciting experiment to me and the feedback 'from the 
field' is very promising so far.


[...] and my suggestion is with regard to a method for working out 
this time that takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq-avg_weight_load /* a running average of the weighted load on the 
CPU */ p-avg_cpu_per_cycle /* the average time in nsecs that p 
spends on the CPU each scheduling cycle */


yes. rq-nr_running is really just a first-level approximation of 
rq-raw_weighted_load. I concentrated on the 'nice 0' case initially.


I appreciate that the notion of basing the expected wait on the 
task's average cpu use per scheduling cycle is counter intuitive but 
I believe that (if you think about it) you'll see that it actually 
makes sense.


hm. So far i tried to not do any statistical approach anywhere: the 
p-wait_runtime metric (which drives the task ordering) is in essence 
an absolutely precise 'integral' of the 'expected runtimes' that the 
task observes and hence is a precise load-average as observed by the 
task


To me this is statistics :-)

in itself. Every time we base some metric on an average value we 
introduce noise into the system.


i definitely agree with your suggestion that CFS should use a 
nice-scaled metric for 'load' instead of the current rq-nr_running, 
but regarding the basic calculations i'd rather lean towards using 
rq-raw_weighted_load. Hm?


This can result in jerkiness (in my experience) but using the smoothed 
version is certainly something that can be tried later rather than 
sooner.  Perhaps just something to bear in mind as a solution to 
jerkiness if it manifests.




your suggestion concentrates on the following scenario: if a task 
happens to schedule in an 'unlucky' way and happens to hit a busy 
period while there are many idle periods. Unless i misunderstood your 
suggestion, that is the main intention behind it, correct?


You misunderstand (that's one of my other schedulers :-)).  This one's 
based on the premise that if everything happens as the task expects it 
will get the amount of CPU bandwidth (over this short period) that it's 
entitled to.  In reality, sometimes it will get more and sometimes less 
but on average it should get what it deserves. E.g. If you had two tasks 
with equal nice and both had demands of 90% of a CPU you'd expect them 
each to get about half of the CPU bandwidth.  Now suppose that one of 
them uses 5ms of CPU each time it got onto the CPU and the other uses 
10ms.  If these two tasks just round robin with each other the likely 
outcome is that the one with the 10ms bursts will get twice as much CPU 
as the other but my proposed method should prevent and cause them to get 
roughly the same amount of CPU.  (I believe this was a scenario that 
caused problems with O(1) and required a fix at some stage?)


BTW this has the advantage that the decay rate used in calculating the 
task's statistics can be used to control how quickly the scheduler 
reacts to changes in the task's behaviour.


I think that, with this model, if the current task hasn't surrendered 
the CPU when the next task on the queue's on CPU time arrives that the 
current task should be pre-empted in favour of that task.  I'm not sure 
what would be the best way to implement this.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Ingo Molnar

* Peter Williams [EMAIL PROTECTED] wrote:

 BTW Given that I'm right and dynamic priorities have been dispensed 
 with what do you intend exporting (in their place) to user space for 
 display by top and similar?

well i thought of only displaying static ones, i.e. like the current 
patch does.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

Peter Williams wrote:

Ingo Molnar wrote:


 - bugfix: use constant offset factor for nice levels instead of
sched_granularity_ns. Thus nice levels work even if someone sets
sched_granularity_ns to 0. NOTE: nice support is still naive, i'll
address the many nice level related suggestions in -v4.


I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it onto 
the CPU and then place it into an ordered list (based on this value) of 
tasks waiting for the CPU.  I think that this is a great idea and my 
suggestion is with regard to a method for working out this time that 
takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq-avg_weight_load /* a running average of the weighted load on the CPU */
p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
CPU each scheduling cycle */


where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off the 
CPU either voluntarily or after being preempted.  So 
p-avg_cpu_per_cycle is just the average amount of time p spends on the 
CPU each time it gets on to the CPU.  Sorry for the long explanation 
here but I just wanted to make sure there was no chance that scheduling 
cycle would be construed as some mechanism being imposed on the 
scheduler.)


We can then define:

effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load)

If p is just waking (i.e. it's not on the queue and its load_weight is 
not included in rq-raw_weighted_load) and we need to queue it, we say 
that the maximum time (in all fairness) that p should have to wait to 
get onto the CPU is:


expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / 
p-load_weight


I just realized that this is wrong for the case where p is being woken. 
 In that case the length of the last sleep should be subtracted from 
the above value and if the result is negative pre-empt straight away. 
So expected_wait becomes:


expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / 
p-load_weight - p-length_of_last_sleep


As the length of the last sleep will be being calculated during the wake 
process there's no real need for it to be a task field and a local 
variable could be used instead.


For a task being requeued during pre-emption the equation would be:

expected_wait = time_just_spent_on_the_cpu * effective_weighted_load / 
p-load_weight


as there was zero sleep since last time on CPU.  Using the actual time 
on the CPU so far this time (instead of the average) will compensate the 
task for being pre-empted.




Calculating p-avg_cpu_per_cycle costs one add, one multiply and one 
shift right per scheduling cycle of the task.  An additional cost is 
that you need a shift right to get the nanosecond value from value 
stored in the task struct. (i.e. the above code is simplified to give 
the general idea).  The average would be number of cycles based rather 
than time based and (happily) this simplifies the calculations.


If you don't like using the average CPU time per cycle, you could just 
use the length of time on the CPU for the last time the task was on the 
CPU.  This would simplify things a lot.


I've been thinking about the jerkiness (or lack of smoothness) that 
I said would occur if smoothed averages weren't used and I've realised 
that what I was talking about was observed jerkiness in the dynamic 
priorities of the tasks.  As this scheduler has dispensed with dynamic 
priorities (as far as I can see) the jerkiness probably won't be apparent.


BTW Given that I'm right and dynamic priorities have been dispensed with 
what do you intend exporting (in their place) to user space for display 
by top and similar?


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
 I have a suggestion I'd like to make that addresses both nice and 
 fairness at the same time.  As I understand the basic principle behind 
 this scheduler it to work out a time by which a task should make it onto 
 the CPU and then place it into an ordered list (based on this value) of 
 tasks waiting for the CPU.  I think that this is a great idea and my 
 suggestion is with regard to a method for working out this time that 
 takes into account both fairness and nice.

Hmm. Let me take a look...


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
 First suppose we have the following metrics available in addition to 
 what's already provided.
 rq-avg_weight_load /* a running average of the weighted load on the CPU */
 p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
 CPU each scheduling cycle */

I'm suspicious of mean service times not paired with mean inter-arrival
times.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
 where a scheduling cycle for a task starts when it is placed on the 
 queue after waking or being preempted and ends when it is taken off the 
 CPU either voluntarily or after being preempted.  So 
 p-avg_cpu_per_cycle is just the average amount of time p spends on the 
 CPU each time it gets on to the CPU.  Sorry for the long explanation 
 here but I just wanted to make sure there was no chance that scheduling 
 cycle would be construed as some mechanism being imposed on the scheduler.)

I went and looked up priority queueing queueing theory garbage and
re-derived various things I needed. The basics check out. Probably no
one cares that I checked.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
 We can then define:
 effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load)
 If p is just waking (i.e. it's not on the queue and its load_weight is 
 not included in rq-raw_weighted_load) and we need to queue it, we say 
 that the maximum time (in all fairness) that p should have to wait to 
 get onto the CPU is:
 expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / 
 p-load_weight

This doesn't look right, probably because the scaling factor of
p-avg_cpu_per_cycle is the reciprocal of its additive contribution to
the -avg_weight_load as opposed to a direct estimate of its initial
delay or waiting time before completing its current requested service.

p-load_weight/effective_weighted_load more properly represents an
entitlement to CPU bandwidth.
p-avg_cpu_per_cycle/(p-load_weight/effective_weighted_load)
would be more like the expected time spent on the runqueue (whether
waiting to run or actually running) for a time interval spent in a
runnable state and the expected time runnable and waiting to run in such
an interval would be
p-avg_cpu_per_cycle*(1-effective_weighted_load/p-load_weight),

Neither represents the initial delay between entering the runqeueue and
first acquiring the CPU, but that's a bit hard to figure out without
deciding the scheduling policy up-front anyway.

This essentially doesn't look correct because while you want to enforce
the CPU bandwidth allocation, this doesn't have much to do with that
apart from the CPU bandwidth appearing as a term. It's more properly
a rate of service as opposed to a time at which anything should happen
or a number useful for predicting such. When service should begin more
properly depends on the other tasks in the system and a number of other
decisions that are part of the scheduling policy.

If you want to choose a quasi-inter-arrival time to achieve the
specified CPU bandwidth allocation, this would be it, but to use that
to actually enforce the CPU bandwidth allocation, you would need to
take into account the genuine inter-arrival time to choose an actual
time for service to begin. In other words, this should be a quota for
the task to have waited. If it's not waited long enough, then it should
be delayed by the difference to achieve the inter-arrival time you're
trying to enforce. If it's waited longer, it should be executed
sooner modulo other constraints, and perhaps even credited for future
scheduling cycles.

In order to partially avoid underallocating CPU bandwidth to p, one
should track the time last spent sleeping and do the following:
last_sleep = now - p-last_went_to_sleep;
wait = p-avg_cpu_per_cycle*effective_weighted_load/p-load_weight;
wait = wait  last_sleep ? wait - last_sleep : 0;

By and large the scheduler can deterministically choose waits on the
runqueue but it doesn't actually do that. Some additional corrections
for delays beyond those decided up-front while on the runqueue or
getting scheduled early on the runqueue may also help, though I'm not
as interested in them as I am the one for sleep:
last_wait = now - p-last_taken_off_the_cpu;
wait = 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Christoph Lameter
On Wed, 18 Apr 2007, William Lee Irwin III wrote:

 
 Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.

False sharing for a per cpu data structure? Are we updating that 
structure from other processors?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
On Wed, 18 Apr 2007, William Lee Irwin III wrote:
 
 Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.

On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
 False sharing for a per cpu data structure? Are we updating that 
 structure from other processors?

Primarily in the load balancer, but also in wakeups.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Christoph Lameter
On Fri, 20 Apr 2007, William Lee Irwin III wrote:

 On Wed, 18 Apr 2007, William Lee Irwin III wrote:
  
  Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.
 
 On Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
  False sharing for a per cpu data structure? Are we updating that 
  structure from other processors?
 
 Primarily in the load balancer, but also in wakeups.

That is fairly rare I think. What other variables that are also writtten 
frequently would cause false sharing?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
Fri, Apr 20, 2007 at 12:24:17PM -0700, Christoph Lameter wrote:
 False sharing for a per cpu data structure? Are we updating that 
 structure from other processors?

On Fri, 20 Apr 2007, William Lee Irwin III wrote:
 Primarily in the load balancer, but also in wakeups.

On Fri, Apr 20, 2007 at 12:33:13PM -0700, Christoph Lameter wrote:
 That is fairly rare I think. What other variables that are also writtten 
 frequently would cause false sharing?

I wrote that backward, sorry. Cross-CPU wakeups' frequency depend
heavily on the workload. Probably the only other case I can think of
is io_schedule() but that's not really significant.

I'm not really convinced it's all that worthwhile of an optimization,
essentially for the same reasons as you, but presumably there's a
benchmark result somewhere that says it matters. I've just not seen it.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Christoph Lameter
On Fri, 20 Apr 2007, William Lee Irwin III wrote:

 I'm not really convinced it's all that worthwhile of an optimization,
 essentially for the same reasons as you, but presumably there's a
 benchmark result somewhere that says it matters. I've just not seen it.

If it is true that we frequently remotely write the per cpu runqueue 
data then we may have a NUMA scalability issue.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
On Fri, 20 Apr 2007, William Lee Irwin III wrote:
 I'm not really convinced it's all that worthwhile of an optimization,
 essentially for the same reasons as you, but presumably there's a
 benchmark result somewhere that says it matters. I've just not seen it.

On Fri, Apr 20, 2007 at 12:44:55PM -0700, Christoph Lameter wrote:
 If it is true that we frequently remotely write the per cpu runqueue 
 data then we may have a NUMA scalability issue.

From the discussion on Suresh's thread, it appears to have sped up a
database benchmark 0.5%.

Last I checked it was workload-dependent, but there were things that
hammer it. I mostly know of the remote wakeup issue, but there could
be other things besides wakeups that do it, too.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

William Lee Irwin III wrote:

On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it onto 
the CPU and then place it into an ordered list (based on this value) of 
tasks waiting for the CPU.  I think that this is a great idea and my 
suggestion is with regard to a method for working out this time that 
takes into account both fairness and nice.


Hmm. Let me take a look...


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
First suppose we have the following metrics available in addition to 
what's already provided.

rq-avg_weight_load /* a running average of the weighted load on the CPU */
p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
CPU each scheduling cycle */


I'm suspicious of mean service times not paired with mean inter-arrival
times.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off the 
CPU either voluntarily or after being preempted.  So 
p-avg_cpu_per_cycle is just the average amount of time p spends on the 
CPU each time it gets on to the CPU.  Sorry for the long explanation 
here but I just wanted to make sure there was no chance that scheduling 
cycle would be construed as some mechanism being imposed on the scheduler.)


I went and looked up priority queueing queueing theory garbage and
re-derived various things I needed. The basics check out. Probably no
one cares that I checked.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:

We can then define:
effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load)
If p is just waking (i.e. it's not on the queue and its load_weight is 
not included in rq-raw_weighted_load) and we need to queue it, we say 
that the maximum time (in all fairness) that p should have to wait to 
get onto the CPU is:
expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / 
p-load_weight


You're right.  The time that the task spent sleeping before being woken 
should be subtracted from this value.  If the answer is less than or 
equal to zero pre-emption should occur.




This doesn't look right, probably because the scaling factor of
p-avg_cpu_per_cycle is the reciprocal of its additive contribution to
the -avg_weight_load as opposed to a direct estimate of its initial
delay or waiting time before completing its current requested service.

p-load_weight/effective_weighted_load more properly represents an
entitlement to CPU bandwidth.


Yes.  But expected_wait isn't entitlement it's its inverse.


p-avg_cpu_per_cycle/(p-load_weight/effective_weighted_load)
would be more like the expected time spent on the runqueue


When I went to school that would be just another way of expressing the 
equation that I expressed.



(whether
waiting to run or actually running) for a time interval spent in a
runnable state and the expected time runnable and waiting to run in such
an interval would be
p-avg_cpu_per_cycle*(1-effective_weighted_load/p-load_weight),

Neither represents the initial delay between entering the runqeueue and
first acquiring the CPU, but that's a bit hard to figure out without
deciding the scheduling policy up-front anyway.

This essentially doesn't look correct because while you want to enforce
the CPU bandwidth allocation, this doesn't have much to do with that
apart from the CPU bandwidth appearing as a term. It's more properly
a rate of service as opposed to a time at which anything should happen
or a number useful for predicting such. When service should begin more
properly depends on the other tasks in the system and a number of other
decisions that are part of the scheduling policy.


This model takes all of those into consideration.  The idea is not just 
to predict but to use the calculated time to decide when to boot the 
current process of the CPU (if it doesn't leave voluntarily) and put 
this one on.  This more or less removes the need to give each task a 
predetermined chunk of CPU when they go on to the CPU.  This should, in 
general, reduce the number context switches as tasks get to run until 
they've finished what they're doing or another task becomes higher 
priority rather than being booted off after an arbitrary time interval. 
 (If this ever gets tried it will be interesting to see if this 
prediction comes true.)


BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
patch (way in the future after Ingo's changes are settled) to try it out 
myself.




If you want to choose a quasi-inter-arrival time to achieve the
specified CPU bandwidth allocation, this would be it, but to use that
to actually enforce the CPU bandwidth allocation, you would need to
take into 

Re: [patch] CFS scheduler, v3

2007-04-20 Thread William Lee Irwin III
William Lee Irwin III wrote:
 This essentially doesn't look correct because while you want to enforce
 the CPU bandwidth allocation, this doesn't have much to do with that
 apart from the CPU bandwidth appearing as a term. It's more properly
 a rate of service as opposed to a time at which anything should happen
 or a number useful for predicting such. When service should begin more
 properly depends on the other tasks in the system and a number of other
 decisions that are part of the scheduling policy.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
 This model takes all of those into consideration.  The idea is not just 
 to predict but to use the calculated time to decide when to boot the 
 current process of the CPU (if it doesn't leave voluntarily) and put 
 this one on.  This more or less removes the need to give each task a 
 predetermined chunk of CPU when they go on to the CPU.  This should, in 
 general, reduce the number context switches as tasks get to run until 
 they've finished what they're doing or another task becomes higher 
 priority rather than being booted off after an arbitrary time interval. 
  (If this ever gets tried it will be interesting to see if this 
 prediction comes true.)
 BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
 patch (way in the future after Ingo's changes are settled) to try it out 
 myself.

I think I smoked out what you were doing.


William Lee Irwin III wrote:
 If you want to choose a quasi-inter-arrival time to achieve the
 specified CPU bandwidth allocation, this would be it, but to use that
 to actually enforce the CPU bandwidth allocation, you would need to
 take into account the genuine inter-arrival time to choose an actual
 time for service to begin. In other words, this should be a quota for
 the task to have waited. If it's not waited long enough, then it should
 be delayed by the difference to achieve the inter-arrival time you're
 trying to enforce. If it's waited longer, it should be executed
 sooner modulo other constraints, and perhaps even credited for future
 scheduling cycles.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
 The idea isn't to enforce the bandwidth entitlement to the extent of 
 throttling tasks if they exceed their entitlement and there's no other 
 tasks ready to use the CPU.  This is mainly because the bandwidth 
 entitlement isn't fixed -- it's changing constantly as the number and 
 type of runnable tasks changes.

Well, a little hysteresis will end up throttling in such a manner
anyway as a side-effect, or you'll get anomalies. Say two tasks with
equal entitlements compete, where one sleeps for 1/3 of the time and
the other is fully CPU-bound. If only the times when they're in direct
competition are split 50/50, then the CPU-bound task gets 2/3 and the
sleeper 1/3, which is not the intended effect. I don't believe this
model will be very vulnerable to it, though.


William Lee Irwin III wrote:
 In order to partially avoid underallocating CPU bandwidth to p, one
 should track the time last spent sleeping and do the following:

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
 Yes I made a mistake in omitting to take into account sleep interval. 
 See another e-mail to Ingo correcting this problem.

I took it to be less trivial of an error than it was. No big deal.


William Lee Irwin III wrote:
 In order to do better, longer-term history is required,

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
 The half life of the Kalman filter (roughly equivalent to a running 
 average) used to calculate the averages determines how much history is 
 taken into account.  It could be made configurable (at least, until 
 enough real life experience was available to decide on the best value to 
 use).

A Kalman filter would do better than a running average. I'm all for it.


William Lee Irwin III wrote:
 To attempt to maintain an infinite history of
 bandwidth underutilization to be credited too far in the future would
 enable potentially long-term overutilization when such credits are
 cashed en masse for a sustained period of time. At some point you have
 to say use it or lose it; over a shorter period of time some smoothing
 is still admissible and even desirable.

On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
 Yes, that's why I suggest a running average over the last few scheduling 
 cycles for the task.  But thinking about it some more I'm now not so 
 sure.  The lack of apparent smoothness when I've done this sort of 
 thing with raw rather than smooth data (in modifications to the current 
 dynamic priority based scheduler model) is usually noticed by running 
 top and seeing wildly fluctuating dynamic priorities.  I'm not sure that 
 the actual responsiveness of the system reflects this.  So I'm now 
 willing to reserve my judgement on this issue.

I'm thinking smoothing should be over a relatively short period of time,

Re: [patch] CFS scheduler, v3

2007-04-20 Thread Peter Williams

William Lee Irwin III wrote:

William Lee Irwin III wrote:

This essentially doesn't look correct because while you want to enforce
the CPU bandwidth allocation, this doesn't have much to do with that
apart from the CPU bandwidth appearing as a term. It's more properly
a rate of service as opposed to a time at which anything should happen
or a number useful for predicting such. When service should begin more
properly depends on the other tasks in the system and a number of other
decisions that are part of the scheduling policy.


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
This model takes all of those into consideration.  The idea is not just 
to predict but to use the calculated time to decide when to boot the 
current process of the CPU (if it doesn't leave voluntarily) and put 
this one on.  This more or less removes the need to give each task a 
predetermined chunk of CPU when they go on to the CPU.  This should, in 
general, reduce the number context switches as tasks get to run until 
they've finished what they're doing or another task becomes higher 
priority rather than being booted off after an arbitrary time interval. 
 (If this ever gets tried it will be interesting to see if this 
prediction comes true.)
BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
patch (way in the future after Ingo's changes are settled) to try it out 
myself.


I think I smoked out what you were doing.


William Lee Irwin III wrote:

If you want to choose a quasi-inter-arrival time to achieve the
specified CPU bandwidth allocation, this would be it, but to use that
to actually enforce the CPU bandwidth allocation, you would need to
take into account the genuine inter-arrival time to choose an actual
time for service to begin. In other words, this should be a quota for
the task to have waited. If it's not waited long enough, then it should
be delayed by the difference to achieve the inter-arrival time you're
trying to enforce. If it's waited longer, it should be executed
sooner modulo other constraints, and perhaps even credited for future
scheduling cycles.


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
The idea isn't to enforce the bandwidth entitlement to the extent of 
throttling tasks if they exceed their entitlement and there's no other 
tasks ready to use the CPU.  This is mainly because the bandwidth 
entitlement isn't fixed -- it's changing constantly as the number and 
type of runnable tasks changes.


Well, a little hysteresis will end up throttling in such a manner
anyway as a side-effect,


Think of this as a calming influence :-)


or you'll get anomalies. Say two tasks with
equal entitlements compete, where one sleeps for 1/3 of the time and
the other is fully CPU-bound. If only the times when they're in direct
competition are split 50/50, then the CPU-bound task gets 2/3 and the
sleeper 1/3, which is not the intended effect. I don't believe this
model will be very vulnerable to it, though.


Nor me.




William Lee Irwin III wrote:

In order to partially avoid underallocating CPU bandwidth to p, one
should track the time last spent sleeping and do the following:


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
Yes I made a mistake in omitting to take into account sleep interval. 
See another e-mail to Ingo correcting this problem.


I took it to be less trivial of an error than it was. No big deal.


No, you were right it was definitely a NON trivial error.




William Lee Irwin III wrote:

In order to do better, longer-term history is required,


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
The half life of the Kalman filter (roughly equivalent to a running 
average) used to calculate the averages determines how much history is 
taken into account.  It could be made configurable (at least, until 
enough real life experience was available to decide on the best value to 
use).


A Kalman filter would do better than a running average. I'm all for it.


As a long time user of Kalman filters I tend to think of them as the 
same thing.  I use the term running average when talking about the idea 
behind a scheduler because I think that more people will understand what 
the general idea is.  When it comes to implementation I always replace 
the idea of running average with a roughly equivalent Kalman filter.





William Lee Irwin III wrote:

To attempt to maintain an infinite history of
bandwidth underutilization to be credited too far in the future would
enable potentially long-term overutilization when such credits are
cashed en masse for a sustained period of time. At some point you have
to say use it or lose it; over a shorter period of time some smoothing
is still admissible and even desirable.


On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
Yes, that's why I suggest a running average over the last few scheduling 
cycles for the task.  But thinking about it some more I'm now not so 
sure.  

Re: [patch] CFS scheduler, v3

2007-04-19 Thread Willy Tarreau
On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
> Ingo Molnar wrote:
> >
> > - bugfix: use constant offset factor for nice levels instead of 
> >   sched_granularity_ns. Thus nice levels work even if someone sets 
> >   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
> >   address the many nice level related suggestions in -v4.
> 
> I have a suggestion I'd like to make that addresses both nice and 
> fairness at the same time.  As I understand the basic principle behind 
> this scheduler it to work out a time by which a task should make it onto 
> the CPU and then place it into an ordered list (based on this value) of 
> tasks waiting for the CPU.  I think that this is a great idea and my 
> suggestion is with regard to a method for working out this time that 
> takes into account both fairness and nice.
> 
> First suppose we have the following metrics available in addition to 
> what's already provided.
> 
> rq->avg_weight_load /* a running average of the weighted load on the CPU */
> p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
> CPU each scheduling cycle */
> 
> where a scheduling cycle for a task starts when it is placed on the 
> queue after waking or being preempted and ends when it is taken off the 
> CPU either voluntarily or after being preempted.  So 
> p->avg_cpu_per_cycle is just the average amount of time p spends on the 
> CPU each time it gets on to the CPU.  Sorry for the long explanation 
> here but I just wanted to make sure there was no chance that "scheduling 
> cycle" would be construed as some mechanism being imposed on the scheduler.)
> 
> We can then define:
> 
> effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
> 
> If p is just waking (i.e. it's not on the queue and its load_weight is 
> not included in rq->raw_weighted_load) and we need to queue it, we say 
> that the maximum time (in all fairness) that p should have to wait to 
> get onto the CPU is:
> 
> expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
> p->load_weight
> 
> Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
> shift right per scheduling cycle of the task.  An additional cost is 
> that you need a shift right to get the nanosecond value from value 
> stored in the task struct. (i.e. the above code is simplified to give 
> the general idea).  The average would be number of cycles based rather 
> than time based and (happily) this simplifies the calculations.
> 
> If the expected time to get onto the CPU (i.e. expected_wait plus the 
> current time) for p is earlier than the equivalent time for the 
> currently running task then preemption of that task would be justified.

I 100% agree on this method because I came to nearly the same conclusion on
paper about 1 year ago. What I'd like to add is that the expected wake up time
is not the most precise criterion for fairness. The expected completion
time is better. When you have one task t1 which is expected to run for T1
nanosecs and another task t2 which is expected to run for T2, what is
important for the user for fairness is when the task completes its work. If
t1 should wake up at time W1 and t2 at W2, then the list should be ordered
by comparing W1+T1 and W2+T2.

What I like with this method is that it remains fair with nice tasks because
because in order to renice a task tN, you just have to change TN, and if it
has to run shorter, it can be executed before CPU hogs and stay there for a
very short time.

Also, I found that if we want to respect interactivity, we must conserve a
credit for each task. It is a bounded amount of CPU time left to be used. When
the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not
spend T3 nsec on the CPU, but only N3http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-19 Thread Peter Williams

Ingo Molnar wrote:


 - bugfix: use constant offset factor for nice levels instead of 
   sched_granularity_ns. Thus nice levels work even if someone sets 
   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
   address the many nice level related suggestions in -v4.


I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it onto 
the CPU and then place it into an ordered list (based on this value) of 
tasks waiting for the CPU.  I think that this is a great idea and my 
suggestion is with regard to a method for working out this time that 
takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq->avg_weight_load /* a running average of the weighted load on the CPU */
p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
CPU each scheduling cycle */


where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off the 
CPU either voluntarily or after being preempted.  So 
p->avg_cpu_per_cycle is just the average amount of time p spends on the 
CPU each time it gets on to the CPU.  Sorry for the long explanation 
here but I just wanted to make sure there was no chance that "scheduling 
cycle" would be construed as some mechanism being imposed on the scheduler.)


We can then define:

effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)

If p is just waking (i.e. it's not on the queue and its load_weight is 
not included in rq->raw_weighted_load) and we need to queue it, we say 
that the maximum time (in all fairness) that p should have to wait to 
get onto the CPU is:


expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / 
p->load_weight


Calculating p->avg_cpu_per_cycle costs one add, one multiply and one 
shift right per scheduling cycle of the task.  An additional cost is 
that you need a shift right to get the nanosecond value from value 
stored in the task struct. (i.e. the above code is simplified to give 
the general idea).  The average would be number of cycles based rather 
than time based and (happily) this simplifies the calculations.


If the expected time to get onto the CPU (i.e. expected_wait plus the 
current time) for p is earlier than the equivalent time for the 
currently running task then preemption of that task would be justified.


I appreciate that the notion of basing the expected wait on the task's 
average cpu use per scheduling cycle is counter intuitive but I believe 
that (if you think about it) you'll see that it actually makes sense.


Peter
PS Some reordering of calculation order within the expressions might be 
in order to keep them within the range of 32 bit arithmetic and so avoid 
64 bit arithmetic on 32 bit machines.

--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-19 Thread Peter Williams

Ingo Molnar wrote:


 - bugfix: use constant offset factor for nice levels instead of 
   sched_granularity_ns. Thus nice levels work even if someone sets 
   sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
   address the many nice level related suggestions in -v4.


I have a suggestion I'd like to make that addresses both nice and 
fairness at the same time.  As I understand the basic principle behind 
this scheduler it to work out a time by which a task should make it onto 
the CPU and then place it into an ordered list (based on this value) of 
tasks waiting for the CPU.  I think that this is a great idea and my 
suggestion is with regard to a method for working out this time that 
takes into account both fairness and nice.


First suppose we have the following metrics available in addition to 
what's already provided.


rq-avg_weight_load /* a running average of the weighted load on the CPU */
p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
CPU each scheduling cycle */


where a scheduling cycle for a task starts when it is placed on the 
queue after waking or being preempted and ends when it is taken off the 
CPU either voluntarily or after being preempted.  So 
p-avg_cpu_per_cycle is just the average amount of time p spends on the 
CPU each time it gets on to the CPU.  Sorry for the long explanation 
here but I just wanted to make sure there was no chance that scheduling 
cycle would be construed as some mechanism being imposed on the scheduler.)


We can then define:

effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load)

If p is just waking (i.e. it's not on the queue and its load_weight is 
not included in rq-raw_weighted_load) and we need to queue it, we say 
that the maximum time (in all fairness) that p should have to wait to 
get onto the CPU is:


expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / 
p-load_weight


Calculating p-avg_cpu_per_cycle costs one add, one multiply and one 
shift right per scheduling cycle of the task.  An additional cost is 
that you need a shift right to get the nanosecond value from value 
stored in the task struct. (i.e. the above code is simplified to give 
the general idea).  The average would be number of cycles based rather 
than time based and (happily) this simplifies the calculations.


If the expected time to get onto the CPU (i.e. expected_wait plus the 
current time) for p is earlier than the equivalent time for the 
currently running task then preemption of that task would be justified.


I appreciate that the notion of basing the expected wait on the task's 
average cpu use per scheduling cycle is counter intuitive but I believe 
that (if you think about it) you'll see that it actually makes sense.


Peter
PS Some reordering of calculation order within the expressions might be 
in order to keep them within the range of 32 bit arithmetic and so avoid 
64 bit arithmetic on 32 bit machines.

--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-19 Thread Willy Tarreau
On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
 Ingo Molnar wrote:
 
  - bugfix: use constant offset factor for nice levels instead of 
sched_granularity_ns. Thus nice levels work even if someone sets 
sched_granularity_ns to 0. NOTE: nice support is still naive, i'll 
address the many nice level related suggestions in -v4.
 
 I have a suggestion I'd like to make that addresses both nice and 
 fairness at the same time.  As I understand the basic principle behind 
 this scheduler it to work out a time by which a task should make it onto 
 the CPU and then place it into an ordered list (based on this value) of 
 tasks waiting for the CPU.  I think that this is a great idea and my 
 suggestion is with regard to a method for working out this time that 
 takes into account both fairness and nice.
 
 First suppose we have the following metrics available in addition to 
 what's already provided.
 
 rq-avg_weight_load /* a running average of the weighted load on the CPU */
 p-avg_cpu_per_cycle /* the average time in nsecs that p spends on the 
 CPU each scheduling cycle */
 
 where a scheduling cycle for a task starts when it is placed on the 
 queue after waking or being preempted and ends when it is taken off the 
 CPU either voluntarily or after being preempted.  So 
 p-avg_cpu_per_cycle is just the average amount of time p spends on the 
 CPU each time it gets on to the CPU.  Sorry for the long explanation 
 here but I just wanted to make sure there was no chance that scheduling 
 cycle would be construed as some mechanism being imposed on the scheduler.)
 
 We can then define:
 
 effective_weighted_load = max(rq-raw_weighted_load, rq-avg_weighted_load)
 
 If p is just waking (i.e. it's not on the queue and its load_weight is 
 not included in rq-raw_weighted_load) and we need to queue it, we say 
 that the maximum time (in all fairness) that p should have to wait to 
 get onto the CPU is:
 
 expected_wait = p-avg_cpu_per_cycle * effective_weighted_load / 
 p-load_weight
 
 Calculating p-avg_cpu_per_cycle costs one add, one multiply and one 
 shift right per scheduling cycle of the task.  An additional cost is 
 that you need a shift right to get the nanosecond value from value 
 stored in the task struct. (i.e. the above code is simplified to give 
 the general idea).  The average would be number of cycles based rather 
 than time based and (happily) this simplifies the calculations.
 
 If the expected time to get onto the CPU (i.e. expected_wait plus the 
 current time) for p is earlier than the equivalent time for the 
 currently running task then preemption of that task would be justified.

I 100% agree on this method because I came to nearly the same conclusion on
paper about 1 year ago. What I'd like to add is that the expected wake up time
is not the most precise criterion for fairness. The expected completion
time is better. When you have one task t1 which is expected to run for T1
nanosecs and another task t2 which is expected to run for T2, what is
important for the user for fairness is when the task completes its work. If
t1 should wake up at time W1 and t2 at W2, then the list should be ordered
by comparing W1+T1 and W2+T2.

What I like with this method is that it remains fair with nice tasks because
because in order to renice a task tN, you just have to change TN, and if it
has to run shorter, it can be executed before CPU hogs and stay there for a
very short time.

Also, I found that if we want to respect interactivity, we must conserve a
credit for each task. It is a bounded amount of CPU time left to be used. When
the task t3 has the right to use T3 nsecs, and wakes up at W3, if it does not
spend T3 nsec on the CPU, but only N3T3, then we have a credit C3 computed
like this :

  C3 = MAX(MAX_CREDIT, C3 + T3 - N3)

And if a CPU hog uses more than its assigned time slice due to scheduler
resolution, then C3 can become negative (and bounded too) :

  C3 = MAX(MIN_CREDIT, C3 + T3 - N3)

Now how is the credit used ? Simple: the credit is a part of a timeslice, so
it's systematically added to the computed timeslice when ordering the tasks.
So we indeed order tasks t1 and t2 by W1+T1+C1 and W2+T2+C2.

It means that an interactive task which has not eaten all of timeslice will
accumulate time credit have great chances of being able to run before others
if it wakes up again, and even use slightly more CPU time than others if it
has not used it before. Conversely, if a task eats too much CPU time, it will
be punished and wait longer than others, and run less, to compensate for the
advance it has taken.

Also, what I like with this method is that it can correctly handle the
fork/exit storms which quickly change the per-task allocated CPU time.
Upon fork() or exit(), it should not be too hard to readjust Tx and Cx
for each task and reorder them according to their new completion time.

I've not found a way to include a variable nr_running in the tree and
order the tasks 

Re: [patch] CFS scheduler, v3

2007-04-18 Thread Ingo Molnar

* William Lee Irwin III <[EMAIL PROTECTED]> wrote:

> It appears to me that the following can be taken in for mainline (or 
> rejected for mainline) independently of the rest of the cfs patch.

yeah - it's a patch written by Suresh, and this should already be in the 
for-v2.6.22 -mm queue. See:

  Subject: [patch] sched: align rq to cacheline boundary

on lkml.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-18 Thread William Lee Irwin III
On Wed, Apr 18, 2007 at 07:50:17PM +0200, Ingo Molnar wrote:
> this is the third release of the CFS patchset (against v2.6.21-rc7), and 
> can be downloaded from:
>http://redhat.com/~mingo/cfs-scheduler/
> this is a pure "fix reported regressions" release so there's much less 
> churn:
>5 files changed, 71 insertions(+), 29 deletions(-)
> (the added lines are mostly debug related, not genuine increase in the 
> scheduler's size)

It appears to me that the following can be taken in for mainline
(or rejected for mainline) independently of the rest of the cfs patch.


-- wli

Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.

Index: sched/kernel/sched.c
===
--- sched.orig/kernel/sched.c   2007-04-18 14:10:03.593207728 -0700
+++ sched/kernel/sched.c2007-04-18 14:11:39.270660075 -0700
@@ -278,7 +278,7 @@
struct lock_class_key rq_lock_key;
 };
 
-static DEFINE_PER_CPU(struct rq, runqueues);
+static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
 
 static inline int cpu_of(struct rq *rq)
 {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-18 Thread William Lee Irwin III
On Wed, Apr 18, 2007 at 07:50:17PM +0200, Ingo Molnar wrote:
 this is the third release of the CFS patchset (against v2.6.21-rc7), and 
 can be downloaded from:
http://redhat.com/~mingo/cfs-scheduler/
 this is a pure fix reported regressions release so there's much less 
 churn:
5 files changed, 71 insertions(+), 29 deletions(-)
 (the added lines are mostly debug related, not genuine increase in the 
 scheduler's size)

It appears to me that the following can be taken in for mainline
(or rejected for mainline) independently of the rest of the cfs patch.


-- wli

Mark the runqueues cacheline_aligned_in_smp to avoid false sharing.

Index: sched/kernel/sched.c
===
--- sched.orig/kernel/sched.c   2007-04-18 14:10:03.593207728 -0700
+++ sched/kernel/sched.c2007-04-18 14:11:39.270660075 -0700
@@ -278,7 +278,7 @@
struct lock_class_key rq_lock_key;
 };
 
-static DEFINE_PER_CPU(struct rq, runqueues);
+static DEFINE_PER_CPU(struct rq, runqueues) cacheline_aligned_in_smp;
 
 static inline int cpu_of(struct rq *rq)
 {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, v3

2007-04-18 Thread Ingo Molnar

* William Lee Irwin III [EMAIL PROTECTED] wrote:

 It appears to me that the following can be taken in for mainline (or 
 rejected for mainline) independently of the rest of the cfs patch.

yeah - it's a patch written by Suresh, and this should already be in the 
for-v2.6.22 -mm queue. See:

  Subject: [patch] sched: align rq to cacheline boundary

on lkml.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/