Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-28 Thread Chris Snook

Tong Li wrote:
Without the global locking, the global synchronization here is simply 
ping-ponging a cache line once of while. This doesn't look expensive to 
me, but if it does after benchmarking, adjusting sysctl_base_round_slice 
can reduce the ping-pong frequency. There might also be a smart 
implementation that can alleviate this problem.


Scaling it proportionally to migration cost and log2(cpus) should suffice.

I don't understand why quantizing CPU time is a bad thing. Could you 
educate me on this?


It depends on how precisely you do it.  We save a lot of power going 
tickless.  If round expiration is re-introducing ticks on idle CPUs, we 
could waste a lot of power.  Hardware is getting even more aggressive 
about power saving, to the point of allowing individual cores to be 
completely powered off when idle.  We need to make sure the scheduler 
doesn't interfere with power management.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-28 Thread Chris Snook

Tong Li wrote:

On Fri, 27 Jul 2007, Chris Snook wrote:


Bill Huey (hui) wrote:
You have to consider the target for this kind of code. There are 
applications
where you need something that falls within a constant error bound. 
According
to the numbers, the current CFS rebalancing logic doesn't achieve 
that to
any degree of rigor. So CFS is ok for SCHED_OTHER, but not for 
anything more

strict than that.


I've said from the beginning that I think that anyone who desperately 
needs perfect fairness should be explicitly enforcing it with the aid 
of realtime priorities.  The problem is that configuring and tuning a 
realtime application is a pain, and people want to be able to 
approximate this behavior without doing a whole lot of dirty work 
themselves.  I believe that CFS can and should be enhanced to ensure 
SMP-fairness over potentially short, user-configurable intervals, even 
for SCHED_OTHER.  I do not, however, believe that we should take it to 
the extreme of wasting CPU cycles on migrations that will not improve 
performance for *any* task, just to avoid letting some tasks get ahead 
of others.  We should be as fair as possible but no fairer.  If we've 
already made it as fair as possible, we should account for the margin 
of error and correct for it the next time we rebalance.  We should not 
burn the surplus just to get rid of it.


Proportional-share scheduling actually has one of its roots in real-time 
and having a p-fair scheduler is essential for real-time apps (soft 
real-time).


Sounds like another scheduler class might be in order.  I find CFS to be 
fair enough for most purposes.  If the code that gives us near-perfect 
fairness at the expense of efficiency only runs when tasks have been 
given boosted priority by a privileged user, and only on the CPUs that 
have such tasks queued on them, the run time overhead and code 
complexity become much smaller concerns.




On a non-NUMA box with single-socket, non-SMT processors, a constant 
error bound is fine.  Once we add SMT, go multi-core, go NUMA, and add 
inter-chassis interconnects on top of that, we need to multiply this 
error bound at each stage in the hierarchy, or else we'll end up 
wasting CPU cycles on migrations that actually hurt the processes 
they're supposed to be helping, and hurt everyone else even more.  I 
believe we should enforce an error bound that is proportional to 
migration cost.




I think we are actually in agreement. When I say constant bound, it can 
certainly be a constant that's determined based on inputs from the 
memory hierarchy. The point is that it needs to be a constant 
independent of things like # of tasks.


Agreed.

But this patch is only relevant to SCHED_OTHER.  The realtime 
scheduler doesn't have a concept of fairness, just priorities.  That 
why each realtime priority level has its own separate runqueue.  
Realtime schedulers are supposed to be dumb as a post, so they cannot 
heuristically decide to do anything other than precisely what you 
configured them to do, and so they don't get in the way when you're 
context switching a million times a second.


Are you referring to hard real-time? As I said, an infrastructure that 
enables p-fair scheduling, EDF, or things alike is the foundation for 
real-time. I designed DWRR, however, with a target of non-RT apps, 
although I was hoping the research results might be applicable to RT.


I'm referring to the static priority SCHED_FIFO and SCHED_RR schedulers, 
which are (intentionally) dumb as a post, allowing userspace to manage 
CPU time explicitly.  Proportionally fair scheduling is a cool 
capability, but not a design goal of those schedulers.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-28 Thread Tong Li

On Fri, 27 Jul 2007, Chris Snook wrote:


Bill Huey (hui) wrote:
You have to consider the target for this kind of code. There are 
applications
where you need something that falls within a constant error bound. 
According

to the numbers, the current CFS rebalancing logic doesn't achieve that to
any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything 
more

strict than that.


I've said from the beginning that I think that anyone who desperately needs 
perfect fairness should be explicitly enforcing it with the aid of realtime 
priorities.  The problem is that configuring and tuning a realtime 
application is a pain, and people want to be able to approximate this 
behavior without doing a whole lot of dirty work themselves.  I believe that 
CFS can and should be enhanced to ensure SMP-fairness over potentially short, 
user-configurable intervals, even for SCHED_OTHER.  I do not, however, 
believe that we should take it to the extreme of wasting CPU cycles on 
migrations that will not improve performance for *any* task, just to avoid 
letting some tasks get ahead of others.  We should be as fair as possible but 
no fairer.  If we've already made it as fair as possible, we should account 
for the margin of error and correct for it the next time we rebalance.  We 
should not burn the surplus just to get rid of it.


Proportional-share scheduling actually has one of its roots in real-time 
and having a p-fair scheduler is essential for real-time apps (soft 
real-time).




On a non-NUMA box with single-socket, non-SMT processors, a constant error 
bound is fine.  Once we add SMT, go multi-core, go NUMA, and add 
inter-chassis interconnects on top of that, we need to multiply this error 
bound at each stage in the hierarchy, or else we'll end up wasting CPU cycles 
on migrations that actually hurt the processes they're supposed to be 
helping, and hurt everyone else even more.  I believe we should enforce an 
error bound that is proportional to migration cost.




I think we are actually in agreement. When I say constant bound, it can 
certainly be a constant that's determined based on inputs from the memory 
hierarchy. The point is that it needs to be a constant independent of 
things like # of tasks.


But this patch is only relevant to SCHED_OTHER.  The realtime scheduler 
doesn't have a concept of fairness, just priorities.  That why each realtime 
priority level has its own separate runqueue.  Realtime schedulers are 
supposed to be dumb as a post, so they cannot heuristically decide to do 
anything other than precisely what you configured them to do, and so they 
don't get in the way when you're context switching a million times a second.


Are you referring to hard real-time? As I said, an infrastructure that 
enables p-fair scheduling, EDF, or things alike is the foundation for 
real-time. I designed DWRR, however, with a target of non-RT apps, 
although I was hoping the research results might be applicable to RT.


  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-28 Thread Tong Li

On Fri, 27 Jul 2007, Chris Snook wrote:

I don't think that achieving a constant error bound is always a good thing. 
We all know that fairness has overhead.  If I have 3 threads and 2 
processors, and I have a choice between fairly giving each thread 1.0 billion 
cycles during the next second, or unfairly giving two of them 1.1 billion 
cycles and giving the other 0.9 billion cycles, then we can have a useful 
discussion about where we want to draw the line on the fairness/performance 
tradeoff.  On the other hand, if we can give two of them 1.1 billion cycles 
and still give the other one 1.0 billion cycles, it's madness to waste those 
0.2 billion cycles just to avoid user jealousy.  The more complex the memory 
topology of a system, the more "free" cycles you'll get by tolerating 
short-term unfairness.  As a crude heuristic, scaling some fairly low 
tolerance by log2(NCPUS) seems appropriate, but eventually we should take the 
boot-time computed migration costs into consideration.


I think we are in agreement. To avoid confusion, I think we should be more 
precise on what fairness means. Lag (i.e., ideal fair time - actual 
service time) is the commonly used metric for fairness. The definition is 
that a scheduler is proportionally fair if for any task in any time 
interval, the task's lag is bounded by a constant (note it's in terms of 
absolute time). The knob here is this constant and can help trade off 
performance and fairness. The reason for a constant bound is that we want 
consistent fairness properties regardless of the number of tasks. For 
example, we don't want the system to be much less fair as the number of 
tasks increases. With DWRR, the lag bound is the max weight of currently 
running tasks, multiplied by sysctl_base_round_slice. So if all tasks are 
of nice 0, i.e., weight 1, and sysctl_base_round_slice equals 30 ms, then 
we are guaranteed each task is at most 30ms off of the ideal case. This is 
a useful property. Just like what you mentioned about the migration cost, 
this property allows the scheduler or user to accurately reason about the 
tradeoffs. If we want to trade fairness for performance, we can increase 
sysctl_base_round_slice to, say, 100ms; doing so we also know accurately 
the worst impact it has on fairness.


Adding system calls, while great for research, is not something which is done 
lightly in the published kernel.  If we're going to implement a user 
interface beyond simply interpreting existing priorities more precisely, it 
would be nice if this was part of a framework with a broader vision, such as 
a scheduler economy.


Agreed. I've seen papers on scheduler economy but not familiar enough to 
comment on it.





Scheduling Algorithm:

The scheduler keeps a set data structures, called Trio groups, to maintain 
the weight or reservation of each thread group (including one or more 
threads) and the local weight of each member thread. When scheduling a 
thread, it consults these data structures and computes (in constant time) a 
system-wide weight for the thread that represents an equivalent CPU share. 
Consequently, the scheduling algorithm, DWRR, operates solely based on the 
system-wide weight (or weight for short, hereafter) of each thread. Having 
a flat space of system-wide weights for individual threads avoids 
performing seperate scheduling at each level of the group hierarchy and 
thus greatly simplies the implementation for group scheduling.


Implementing a flat weight space efficiently is nontrivial.  I'm curious to 
see how you reworked the original patch without global locking.


I simply removed the locking and changed a little bit in idle_balance(). 
The lock was trying to avoid a thread from reading or writing the global 
highest round value while another thread is writing to it. For writes, 
it's simple to ensure without locking only one write takes effect when 
multiple writes are concurrent. For the case that there's one write going 
on and multiple threads read, without locking, the only problem is that a 
reader may read a stale value and thus thinks the current highest round is 
X while it's actually X + 1. The end effect is that a thread can be at 
most two rounds behind the highest round. This changes DWRR's lag bound to 
2 * (max weight of current tasks) * sysctl_base_round_slice, which is 
still constant.


I had a feeling this patch was originally designed for the O(1) scheduler, 
and this is why.  The old scheduler had expired arrays, so adding a 
round-expired array wasn't a radical departure from the design.  CFS does not 
have an expired rbtree, so adding one *is* a radical departure from the 
design.  I think we can implement DWRR or something very similar without 
using this implementation method.  Since we've already got a tree of queued 
tasks, it might be easiest to basically break off one subtree (usually just 
one task, but not necessarily) and migrate it to a less loaded tree whenever 
we can reduce the difference be

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread Chris Snook

Bill Huey (hui) wrote:

On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote:
I don't think that achieving a constant error bound is always a good thing. 
 We all know that fairness has overhead.  If I have 3 threads and 2 
processors, and I have a choice between fairly giving each thread 1.0 
billion cycles during the next second, or unfairly giving two of them 1.1 
billion cycles and giving the other 0.9 billion cycles, then we can have a 
useful discussion about where we want to draw the line on the 
fairness/performance tradeoff.  On the other hand, if we can give two of 
them 1.1 billion cycles and still give the other one 1.0 billion cycles, 
it's madness to waste those 0.2 billion cycles just to avoid user jealousy. 
 The more complex the memory topology of a system, the more "free" cycles 
you'll get by tolerating short-term unfairness.  As a crude heuristic, 
scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but 
eventually we should take the boot-time computed migration costs into 
consideration.


You have to consider the target for this kind of code. There are applications
where you need something that falls within a constant error bound. According
to the numbers, the current CFS rebalancing logic doesn't achieve that to
any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more
strict than that.


I've said from the beginning that I think that anyone who desperately needs 
perfect fairness should be explicitly enforcing it with the aid of realtime 
priorities.  The problem is that configuring and tuning a realtime application 
is a pain, and people want to be able to approximate this behavior without doing 
a whole lot of dirty work themselves.  I believe that CFS can and should be 
enhanced to ensure SMP-fairness over potentially short, user-configurable 
intervals, even for SCHED_OTHER.  I do not, however, believe that we should take 
it to the extreme of wasting CPU cycles on migrations that will not improve 
performance for *any* task, just to avoid letting some tasks get ahead of 
others.  We should be as fair as possible but no fairer.  If we've already made 
it as fair as possible, we should account for the margin of error and correct 
for it the next time we rebalance.  We should not burn the surplus just to get 
rid of it.


On a non-NUMA box with single-socket, non-SMT processors, a constant error bound 
is fine.  Once we add SMT, go multi-core, go NUMA, and add inter-chassis 
interconnects on top of that, we need to multiply this error bound at each stage 
in the hierarchy, or else we'll end up wasting CPU cycles on migrations that 
actually hurt the processes they're supposed to be helping, and hurt everyone 
else even more.  I believe we should enforce an error bound that is proportional 
to migration cost.



Even the rt overload code (from my memory) is subject to these limitations
as well until it's moved to use a single global queue while using CPU
binding to turn off that logic. It's the price you pay for accuracy.

If we allow a little short-term fairness (and I think we should) we can 
still account for this unfairness and compensate for it (again, with the 
same tolerance) at the next rebalancing.


Again, it's a function of *when* and depends on that application.

Adding system calls, while great for research, is not something which is 
done lightly in the published kernel.  If we're going to implement a user 
interface beyond simply interpreting existing priorities more precisely, it 
would be nice if this was part of a framework with a broader vision, such 
as a scheduler economy.


I'm not sure what you mean by scheduler economy, but CFS can and should
be extended to handle proportional scheduling which is outside of the
traditional Unix priority semantics. Having a new API to get at this is
unavoidable if you want it to eventually support -rt oriented appications
that have bandwidth semantics.


A scheduler economy is basically a credit scheduler, augmented to allow 
processes to exchange credits with each other.  If you want to get more 
sophisticated with fairness, you could price CPU time proportional to load on 
that CPU.


I've been house-hunting lately, so I like to think of it in real estate terms. 
If you're comfortable with your standard of living and you have enough money, 
you can rent the apartment in the chic part of town, right next to the subway 
station.  If you want to be more frugal because you're saving for retirement, 
you can get a place out in the suburbs, but the commute will be more of a pain. 
 If you can't make up your mind and keep moving back and forth, you spend a lot 
on moving and all your stuff gets dented and scratched.



All deadline based schedulers have API mechanisms like this to support
extended semantics. This is no different.

I had a feeling this patch was originally designed for the O(1) scheduler, 
and this is why.  The old scheduler had expired arrays, so adding a 
round-expired a

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread hui
On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote:
> I don't think that achieving a constant error bound is always a good thing. 
>  We all know that fairness has overhead.  If I have 3 threads and 2 
> processors, and I have a choice between fairly giving each thread 1.0 
> billion cycles during the next second, or unfairly giving two of them 1.1 
> billion cycles and giving the other 0.9 billion cycles, then we can have a 
> useful discussion about where we want to draw the line on the 
> fairness/performance tradeoff.  On the other hand, if we can give two of 
> them 1.1 billion cycles and still give the other one 1.0 billion cycles, 
> it's madness to waste those 0.2 billion cycles just to avoid user jealousy. 
>  The more complex the memory topology of a system, the more "free" cycles 
> you'll get by tolerating short-term unfairness.  As a crude heuristic, 
> scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but 
> eventually we should take the boot-time computed migration costs into 
> consideration.

You have to consider the target for this kind of code. There are applications
where you need something that falls within a constant error bound. According
to the numbers, the current CFS rebalancing logic doesn't achieve that to
any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more
strict than that.

Even the rt overload code (from my memory) is subject to these limitations
as well until it's moved to use a single global queue while using CPU
binding to turn off that logic. It's the price you pay for accuracy.

> If we allow a little short-term fairness (and I think we should) we can 
> still account for this unfairness and compensate for it (again, with the 
> same tolerance) at the next rebalancing.

Again, it's a function of *when* and depends on that application.

> Adding system calls, while great for research, is not something which is 
> done lightly in the published kernel.  If we're going to implement a user 
> interface beyond simply interpreting existing priorities more precisely, it 
> would be nice if this was part of a framework with a broader vision, such 
> as a scheduler economy.

I'm not sure what you mean by scheduler economy, but CFS can and should
be extended to handle proportional scheduling which is outside of the
traditional Unix priority semantics. Having a new API to get at this is
unavoidable if you want it to eventually support -rt oriented appications
that have bandwidth semantics.

All deadline based schedulers have API mechanisms like this to support
extended semantics. This is no different.

> I had a feeling this patch was originally designed for the O(1) scheduler, 
> and this is why.  The old scheduler had expired arrays, so adding a 
> round-expired array wasn't a radical departure from the design.  CFS does 
> not have an expired rbtree, so adding one *is* a radical departure from the 
> design.  I think we can implement DWRR or something very similar without 
> using this implementation method.  Since we've already got a tree of queued 
> tasks, it might be easiest to basically break off one subtree (usually just 
> one task, but not necessarily) and migrate it to a less loaded tree 
> whenever we can reduce the difference between the load on the two trees by 
> at least half.  This would prevent both overcorrection and undercorrection.

> The idea of rounds was another implementation detail that bothered me.  In 
> the old scheduler, quantizing CPU time was a necessary evil.  Now that we 
> can account for CPU time with nanosecond resolution, doing things on an 
> as-needed basis seems more appropriate, and should reduce the need for 
> global synchronization.

Well, there's nanosecond resolution with no mechanism that exploits it for
rebalancing. Rebalancing in general is a pain and the code for it is
generally orthogonal to the in-core scheduler data structures that are in
use, so I don't understand the objection to this argument and the choice
of methods. If it it gets the job done, then these kind of choices don't
have that much meaning.

> In summary, I think the accounting is sound, but the enforcement is 
> sub-optimal for the new scheduler.  A revision of the algorithm more 
> cognizant of the capabilities and design of the current scheduler would 
> seem to be in order.

That would be nice. But the amount of error in Tong's solution is much
less than the current CFS logic as was previously tested even without
consideration to high resolution clocks.

So you have to give some kind of credit for that approach and recognized
that current methods in CFS are technically a dead end if there's a need for
strict fairness in a more rigorous run category than SCHED_OTHER.

> I've referenced many times my desire to account for CPU/memory hierarchy in 
> these patches.  At present, I'm not sure we have sufficient infrastructure 
> in the kernel to automatically optimize for system topology, but I think 
> whatever de

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread Chris Snook

Tong Li wrote:

On Fri, 27 Jul 2007, Chris Snook wrote:


Tong Li wrote:
I'd like to clarify that I'm not trying to push this particular code 
to the kernel. I'm a researcher. My intent was to point out that we 
have a problem in the scheduler and my dwrr algorithm can potentially 
help fix it. The patch itself was merely a proof-of-concept. I'd be 
thrilled if the algorithm can be proven useful in the real world. I 
appreciate the people who have given me comments. Since then, I've 
revised my algorithm/code. Now it doesn't require global locking but 
retains strong fairness properties (which I was able to prove 
mathematically).


Thanks for doing this work.  Please don't take the implementation 
criticism as a lack of appreciation for the work.  I'd like to see 
dwrr in the scheduler, but I'm skeptical that re-introducing expired 
runqueues is the most efficient way to do it.


Given the inherently controversial nature of scheduler code, 
particularly that which attempts to enforce fairness, perhaps a 
concise design document would help us come to an agreement about what 
we think the scheduler should do and what tradeoffs we're willing to 
make to do those things.  Do you have a design document we could discuss?


-- Chris



Thanks for the interest. Attached is a design doc I wrote several months 
ago (with small modifications). It talks about the two pieces of my 
design: group scheduling and dwrr. The description was based on the 
original O(1) scheduler, but as my CFS patch showed, the algorithm is 
applicable to other underlying schedulers as well. It's interesting that 
I started working on this in January for the purpose of eventually 
writing a paper about it. So I knew reasonably well the related research 
work but was totally unaware that people in the Linux community were 
also working on similar things. This is good. If you are interested, I'd 
like to help with the algorithms and theory side of the things.


  tong

---
Overview:

Trio extends the existing Linux scheduler with support for 
proportional-share scheduling. It uses a scheduling algorithm, called 
Distributed Weighted Round-Robin (DWRR), which retains the existing 
scheduler design as much as possible, and extends it to achieve 
proportional fairness with O(1) time complexity and a constant error 
bound, compared to the ideal fair scheduling algorithm. The goal of Trio 
is not to improve interactive performance; rather, it relies on the 
existing scheduler for interactivity and extends it to support MP 
proportional fairness.


Trio has two unique features: (1) it enables users to control shares of 
CPU time for any thread or group of threads (e.g., a process, an 
application, etc.), and (2) it enables fair sharing of CPU time across 
multiple CPUs. For example, with ten tasks running on eight CPUs, Trio 
allows each task to take an equal fraction of the total CPU time. These 
features enable Trio to complement the existing Linux scheduler to 
enable greater user flexibility and stronger fairness.


Background:

Over the years, there has been a lot of criticism that conventional Unix 
priorities and the nice interface provide insufficient support for users 
to accurately control CPU shares of different threads or applications. 
Many have studied scheduling algorithms that achieve proportional 
fairness. Assuming that each thread has a weight that expresses its 
desired CPU share, informally, a scheduler is proportionally fair if (1) 
it is work-conserving, and (2) it allocates CPU time to threads in exact 
proportion to their weights in any time interval. Ideal proportional 
fairness is impractical since it requires that all runnable threads be 
running simultaneously and scheduled with infinitesimally small quanta. 
In practice, every proportional-share scheduling algorithm approximates 
the ideal algorithm with the goal of achieving a constant error bound. 
For more theoretical background, please refer to the following papers:


I don't think that achieving a constant error bound is always a good thing.  We 
all know that fairness has overhead.  If I have 3 threads and 2 processors, and 
I have a choice between fairly giving each thread 1.0 billion cycles during the 
next second, or unfairly giving two of them 1.1 billion cycles and giving the 
other 0.9 billion cycles, then we can have a useful discussion about where we 
want to draw the line on the fairness/performance tradeoff.  On the other hand, 
if we can give two of them 1.1 billion cycles and still give the other one 1.0 
billion cycles, it's madness to waste those 0.2 billion cycles just to avoid 
user jealousy.  The more complex the memory topology of a system, the more 
"free" cycles you'll get by tolerating short-term unfairness.  As a crude 
heuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate, 
but eventually we should take the boot-time computed migration costs into 
consideration.



[1] A. 

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread hui
On Fri, Jul 27, 2007 at 12:03:28PM -0700, Tong Li wrote:
> Thanks for the interest. Attached is a design doc I wrote several months 
> ago (with small modifications). It talks about the two pieces of my design: 
> group scheduling and dwrr. The description was based on the original O(1) 
> scheduler, but as my CFS patch showed, the algorithm is applicable to other 
> underlying schedulers as well. It's interesting that I started working on 
> this in January for the purpose of eventually writing a paper about it. So 
> I knew reasonably well the related research work but was totally unaware 
> that people in the Linux community were also working on similar things. 
> This is good. If you are interested, I'd like to help with the algorithms 
> and theory side of the things.

Tong,

This is sufficient as an overview of the algorithm but not detailed enough
for it to be a discussable design doc I believe. You should ask Chris to see
what he means by this.

Some examples of your rebalancing scheme and how your invariant applies
across processor rounds would be helpful for me and possibly others as well.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread Tong Li

On Fri, 27 Jul 2007, Chris Snook wrote:


Tong Li wrote:
I'd like to clarify that I'm not trying to push this particular code to the 
kernel. I'm a researcher. My intent was to point out that we have a problem 
in the scheduler and my dwrr algorithm can potentially help fix it. The 
patch itself was merely a proof-of-concept. I'd be thrilled if the 
algorithm can be proven useful in the real world. I appreciate the people 
who have given me comments. Since then, I've revised my algorithm/code. Now 
it doesn't require global locking but retains strong fairness properties 
(which I was able to prove mathematically).


Thanks for doing this work.  Please don't take the implementation criticism 
as a lack of appreciation for the work.  I'd like to see dwrr in the 
scheduler, but I'm skeptical that re-introducing expired runqueues is the 
most efficient way to do it.


Given the inherently controversial nature of scheduler code, particularly 
that which attempts to enforce fairness, perhaps a concise design document 
would help us come to an agreement about what we think the scheduler should 
do and what tradeoffs we're willing to make to do those things.  Do you have 
a design document we could discuss?


-- Chris



Thanks for the interest. Attached is a design doc I wrote several months 
ago (with small modifications). It talks about the two pieces of my 
design: group scheduling and dwrr. The description was based on the 
original O(1) scheduler, but as my CFS patch showed, the algorithm is 
applicable to other underlying schedulers as well. It's interesting that I 
started working on this in January for the purpose of eventually writing a 
paper about it. So I knew reasonably well the related research work but 
was totally unaware that people in the Linux community were also working 
on similar things. This is good. If you are interested, I'd like to help 
with the algorithms and theory side of the things.


  tong

---
Overview:

Trio extends the existing Linux scheduler with support for 
proportional-share scheduling. It uses a scheduling algorithm, called 
Distributed Weighted Round-Robin (DWRR), which retains the existing 
scheduler design as much as possible, and extends it to achieve 
proportional fairness with O(1) time complexity and a constant error 
bound, compared to the ideal fair scheduling algorithm. The goal of Trio 
is not to improve interactive performance; rather, it relies on the 
existing scheduler for interactivity and extends it to support MP 
proportional fairness.


Trio has two unique features: (1) it enables users to control shares of 
CPU time for any thread or group of threads (e.g., a process, an 
application, etc.), and (2) it enables fair sharing of CPU time across 
multiple CPUs. For example, with ten tasks running on eight CPUs, Trio 
allows each task to take an equal fraction of the total CPU time. These 
features enable Trio to complement the existing Linux scheduler to enable 
greater user flexibility and stronger fairness.


Background:

Over the years, there has been a lot of criticism that conventional Unix 
priorities and the nice interface provide insufficient support for users 
to accurately control CPU shares of different threads or applications. 
Many have studied scheduling algorithms that achieve proportional 
fairness. Assuming that each thread has a weight that expresses its 
desired CPU share, informally, a scheduler is proportionally fair if (1) 
it is work-conserving, and (2) it allocates CPU time to threads in exact 
proportion to their weights in any time interval. Ideal proportional 
fairness is impractical since it requires that all runnable threads be 
running simultaneously and scheduled with infinitesimally small quanta. In 
practice, every proportional-share scheduling algorithm approximates the 
ideal algorithm with the goal of achieving a constant error bound. For 
more theoretical background, please refer to the following papers:


[1] A. K. Parekh and R. G. Gallager. A generalized processor sharing
approach to flow control in integrated services networks: The single-node
case. IEEE/ACM Transactions on Networking, 1(3):344-357, June 1993.

[2] C. R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fair 
queueing. In Proceedings of IEEE INFOCOM '94, pages 120-128, Mar. 1996.


Previous proportional-share scheduling algorithms, however, suffer one or 
more of the following problems:


(1) Inaccurate fairness with non-constant error bounds;
(2) High run-time overhead (e.g., logarithmic);
(3) Poor scalability due to the use of a global thread queue;
(4) Inefficient support for latency-sensitive applications.

Since the Linux scheduler has been successful at avoiding problems 2 to 4, 
this design attempts to extend it with support for accurate proportional 
fairness while retaining all of its existing benefits.


User Interface:

By default, each thread is assigned a weight proportional to its stat

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread Chris Snook

Tong Li wrote:
I'd like to clarify that I'm not trying to push this particular code to 
the kernel. I'm a researcher. My intent was to point out that we have a 
problem in the scheduler and my dwrr algorithm can potentially help fix 
it. The patch itself was merely a proof-of-concept. I'd be thrilled if 
the algorithm can be proven useful in the real world. I appreciate the 
people who have given me comments. Since then, I've revised my 
algorithm/code. Now it doesn't require global locking but retains strong 
fairness properties (which I was able to prove mathematically).


Thanks for doing this work.  Please don't take the implementation criticism as a 
lack of appreciation for the work.  I'd like to see dwrr in the scheduler, but 
I'm skeptical that re-introducing expired runqueues is the most efficient way to 
do it.


Given the inherently controversial nature of scheduler code, particularly that 
which attempts to enforce fairness, perhaps a concise design document would help 
us come to an agreement about what we think the scheduler should do and what 
tradeoffs we're willing to make to do those things.  Do you have a design 
document we could discuss?


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-26 Thread Tong Li
I'd like to clarify that I'm not trying to push this particular code to 
the kernel. I'm a researcher. My intent was to point out that we have a 
problem in the scheduler and my dwrr algorithm can potentially help fix 
it. The patch itself was merely a proof-of-concept. I'd be thrilled if the 
algorithm can be proven useful in the real world. I appreciate the people 
who have given me comments. Since then, I've revised my algorithm/code. 
Now it doesn't require global locking but retains strong fairness 
properties (which I was able to prove mathematically).


Thank you,

  tong

On Thu, 26 Jul 2007, Li, Tong N wrote:


On Thu, 2007-07-26 at 23:31 +0200, Ingo Molnar wrote:

* Tong Li <[EMAIL PROTECTED]> wrote:


you need to measure it over longer periods of time. Its not worth
balancing for such a thing in any high-frequency manner. (we'd trash
the cache constantly migrating tasks back and forth.)


I have some data below, but before that, I'd like to say, at the same
load balancing rate, my proposed approach would allow us to have
fairness on the order of seconds. I'm less concerned about trashing
the cache. The important thing is to have a knob that allow users to
trade off fairness and performance based on their needs. [...]


such a knob already exists to a certain degree, but i havent tested its
full effects on SMP fairness yet. If you pull my scheduler tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

and if you enable CONFIG_SCHED_DEBUG, then all the sched-domain
parameters become runtime tunable under /proc/sys/cpu*.

Could you try to increase the cross-CPU rebalancing frequency and see
how it impacts the precision of your measurement? Tune 'min_interval'
and 'max_interval' down to increase the frequency of rebalancing.

Ingo


Yes, I'll do it when I find time. If anyone is willing to do the
testing, please let me know and I can post my benchmark. On the other
hand, I don't think tuning the existing knobs will help solve the
problem. The root of the problem is that the current load balancing
doesn't take into account how much time each task has been running and
how much it's entitled. This is why I'm proposing a new approach for
solving it. The new approach, as I said, will be much more fair/accurate
than the current one even without tuning those balancing intervals
(i.e., it's able to provide good fairness even if balancing is less
frequent).

 tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-26 Thread Li, Tong N
On Thu, 2007-07-26 at 23:31 +0200, Ingo Molnar wrote:
> * Tong Li <[EMAIL PROTECTED]> wrote:
> 
> > > you need to measure it over longer periods of time. Its not worth 
> > > balancing for such a thing in any high-frequency manner. (we'd trash 
> > > the cache constantly migrating tasks back and forth.)
> > 
> > I have some data below, but before that, I'd like to say, at the same 
> > load balancing rate, my proposed approach would allow us to have 
> > fairness on the order of seconds. I'm less concerned about trashing 
> > the cache. The important thing is to have a knob that allow users to 
> > trade off fairness and performance based on their needs. [...]
> 
> such a knob already exists to a certain degree, but i havent tested its 
> full effects on SMP fairness yet. If you pull my scheduler tree:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
> 
> and if you enable CONFIG_SCHED_DEBUG, then all the sched-domain 
> parameters become runtime tunable under /proc/sys/cpu*.
> 
> Could you try to increase the cross-CPU rebalancing frequency and see 
> how it impacts the precision of your measurement? Tune 'min_interval' 
> and 'max_interval' down to increase the frequency of rebalancing.
> 
>   Ingo

Yes, I'll do it when I find time. If anyone is willing to do the
testing, please let me know and I can post my benchmark. On the other
hand, I don't think tuning the existing knobs will help solve the
problem. The root of the problem is that the current load balancing
doesn't take into account how much time each task has been running and
how much it's entitled. This is why I'm proposing a new approach for
solving it. The new approach, as I said, will be much more fair/accurate
than the current one even without tuning those balancing intervals
(i.e., it's able to provide good fairness even if balancing is less
frequent).

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-26 Thread Ingo Molnar

* Tong Li <[EMAIL PROTECTED]> wrote:

> > you need to measure it over longer periods of time. Its not worth 
> > balancing for such a thing in any high-frequency manner. (we'd trash 
> > the cache constantly migrating tasks back and forth.)
> 
> I have some data below, but before that, I'd like to say, at the same 
> load balancing rate, my proposed approach would allow us to have 
> fairness on the order of seconds. I'm less concerned about trashing 
> the cache. The important thing is to have a knob that allow users to 
> trade off fairness and performance based on their needs. [...]

such a knob already exists to a certain degree, but i havent tested its 
full effects on SMP fairness yet. If you pull my scheduler tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

and if you enable CONFIG_SCHED_DEBUG, then all the sched-domain 
parameters become runtime tunable under /proc/sys/cpu*.

Could you try to increase the cross-CPU rebalancing frequency and see 
how it impacts the precision of your measurement? Tune 'min_interval' 
and 'max_interval' down to increase the frequency of rebalancing.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-26 Thread Tong Li

On Wed, 25 Jul 2007, Ingo Molnar wrote:



* Tong Li <[EMAIL PROTECTED]> wrote:

> Thanks for the patch. It doesn't work well on my 8-way box. Here's the
> output at two different times. It's also changing all the time.

you need to measure it over longer periods of time. Its not worth
balancing for such a thing in any high-frequency manner. (we'd trash the
cache constantly migrating tasks back and forth.)


I have some data below, but before that, I'd like to say, at the same load 
balancing rate, my proposed approach would allow us to have fairness on 
the order of seconds. I'm less concerned about trashing the cache. The 
important thing is to have a knob that allow users to trade off fairness 
and performance based on their needs. This is the same spirit in CFS, 
which by default can have much more frequent context switches than the old 
O(1) scheduler. A valid concern is that this is going to hurt cache and 
this is something that has not yet been benchmarked thoroughly. On the 
other hand, I fully support CFS on this because I like the idea of having 
a powerful infrastructure in place and exporting a set of knobs that give 
users full flexibility. This is also what I'm trying to achieve with my 
patch for SMP fairness.




the runtime ranges from 1:38.66 to 1:46.38 - that's a +-3% spread which
is pretty acceptable for ~90 seconds of runtime. Note that the
percentages you see above were likely done over a shorter period that's
why you see them range from 61% to 91%.

> I'm running 2.6.23-rc1 with your patch on a two-socket quad-core
> system. Top refresh rate is the default 3s.

the 3s is the problem: change that to 60s! We no way want to
over-migrate for SMP fairness, the change i did gives us reasonable
long-term SMP fairness without the need for high-rate rebalancing.



I ran 10 nice 0 tasks on 8 CPUs. Each task does a trivial while (1) loop. 
I ran the whole thing 300 seconds and, at every 60 seconds, I measured for 
each task the following:


1. Actual CPU time the task received during the past 60 seconds.
2. Ideal CPU time it would receive under a perfect fair scheduler.
3. Lag = ideal time - actual time
Positive lag means the task received less CPU time than its fair share and 
negative means it received more.

4. Error = lag / ideal time

Results of CFS (2.6.23-rc1 with Ingo's SCHED_LOAD_SCALE_FUZZ change):

sampling interval 60 sampling length 300 num tasks 10
-
Task 0 (unit: second) nice 0 weight 1024

Wall Clock CPU Time   Ideal Time Lag   Error 
04:17:09.314 51.14  48.00   -3.14 -6.54%

04:18:09.316 46.05  48.001.95  4.06%
04:19:09.317 44.13  48.003.87  8.06%
04:20:09.318 44.78  48.003.22  6.71%
04:21:09.320 45.35  48.002.65  5.52%
-
Task 1 (unit: second) nice 0 weight 1024

Wall Clock CPU Time   Ideal Time Lag   Error 
04:17:09.314 47.06  48.000.94  1.96%

04:18:09.316 45.80  48.002.20  4.58%
04:19:09.317 49.95  48.00   -1.95 -4.06%
04:20:09.318 47.59  48.000.41  0.85%
04:21:09.320 46.50  48.001.50  3.12%
-
Task 2 (unit: second) nice 0 weight 1024

Wall Clock CPU Time   Ideal Time Lag   Error 
04:17:09.314 48.14  48.00   -0.14 -0.29%

04:18:09.316 48.66  48.00   -0.66 -1.37%
04:19:09.317 47.52  48.000.48  1.00%
04:20:09.318 45.90  48.002.10  4.37%
04:21:09.320 45.89  48.002.11  4.40%
-
Task 3 (unit: second) nice 0 weight 1024

Wall Clock CPU Time   Ideal Time Lag   Error 
04:17:09.314 48.71  48.00   -0.71 -1.48%

04:18:09.316 44.70  48.003.30  6.87%
04:19:09.317 45.86  48.002.14  4.46%
04:20:09.318 45.85  48.002.15  4.48%
04:21:09.320 44.91  48.003.09  6.44%
-
Task 4 (unit: second) nice 0 weight 1024

Wall Clock CPU Time   Ideal Time Lag   Error 
04:17:09.314 49.24  48.00   -1.24 -2.58%

04:18:09.316 45.73  48.002.27  4.73%
04:19:09.317 49.13  48.00   -1.13 -2.35%
04:20:09.318 45.37  48.002.63  5.48%
04:21:09.320 45.75  48.002.25  4.69%
---

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Chris Snook

Li, Tong N wrote:

On Wed, 2007-07-25 at 16:55 -0400, Chris Snook wrote:

Chris Friesen wrote:

Ingo Molnar wrote:

the 3s is the problem: change that to 60s! We no way want to 
over-migrate for SMP fairness, the change i did gives us reasonable 
long-term SMP fairness without the need for high-rate rebalancing.
Actually, I do have requirements from our engineering guys for 
short-term fairness.  They'd actually like decent fairness over even 
shorter intervals...1 second would be nice, 2 is acceptable.


They are willing to trade off random peak performance for predictability.

Chris

The sysctls for CFS have nanosecond resolution.  They default to 
millisecond-order values, but you can set them much lower.  See sched_fair.c for 
the knobs and their explanations.


-- Chris


This is incorrect. Those knobs control local-CPU fairness granularity
but have no control over fairness across CPUs.

I'll do some benchmarking as Ingo suggested.

  tong


CFS naturally enforces cross-CPU fairness anyway, as Ingo demonstrated. 
Lowering the local CPU parameters should cause system-wide fairness to converge 
faster.  It might be worthwhile to create a more explicit knob for this, but I'm 
inclined to believe we could do it in much less than 700 lines.  Ingo's 
one-liner to improve the 10/8 balancing case, and the resulting improvement, 
were exactly what I was saying should be possible and desirable.  TCP Nagle 
aside, it generally shouldn't take 700 lines of code to speed up the rate of 
convergence of something that already converges.


Until now I've been watching the scheduler rewrite from the sidelines, but I'm 
digging into it now.  I'll try to give some more constructive criticism soon.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Li, Tong N
On Wed, 2007-07-25 at 16:55 -0400, Chris Snook wrote:
> Chris Friesen wrote:
> > Ingo Molnar wrote:
> > 
> >> the 3s is the problem: change that to 60s! We no way want to 
> >> over-migrate for SMP fairness, the change i did gives us reasonable 
> >> long-term SMP fairness without the need for high-rate rebalancing.
> > 
> > Actually, I do have requirements from our engineering guys for 
> > short-term fairness.  They'd actually like decent fairness over even 
> > shorter intervals...1 second would be nice, 2 is acceptable.
> > 
> > They are willing to trade off random peak performance for predictability.
> > 
> > Chris
> > 
> 
> The sysctls for CFS have nanosecond resolution.  They default to 
> millisecond-order values, but you can set them much lower.  See sched_fair.c 
> for 
> the knobs and their explanations.
> 
>   -- Chris

This is incorrect. Those knobs control local-CPU fairness granularity
but have no control over fairness across CPUs.

I'll do some benchmarking as Ingo suggested.

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Chris Snook

Chris Friesen wrote:

Ingo Molnar wrote:

the 3s is the problem: change that to 60s! We no way want to 
over-migrate for SMP fairness, the change i did gives us reasonable 
long-term SMP fairness without the need for high-rate rebalancing.


Actually, I do have requirements from our engineering guys for 
short-term fairness.  They'd actually like decent fairness over even 
shorter intervals...1 second would be nice, 2 is acceptable.


They are willing to trade off random peak performance for predictability.

Chris



The sysctls for CFS have nanosecond resolution.  They default to 
millisecond-order values, but you can set them much lower.  See sched_fair.c for 
the knobs and their explanations.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Chris Friesen

Ingo Molnar wrote:

the 3s is the problem: change that to 60s! We no way want to 
over-migrate for SMP fairness, the change i did gives us reasonable 
long-term SMP fairness without the need for high-rate rebalancing.


Actually, I do have requirements from our engineering guys for 
short-term fairness.  They'd actually like decent fairness over even 
shorter intervals...1 second would be nice, 2 is acceptable.


They are willing to trade off random peak performance for predictability.

Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Ingo Molnar

* Tong Li <[EMAIL PROTECTED]> wrote:

> Thanks for the patch. It doesn't work well on my 8-way box. Here's the 
> output at two different times. It's also changing all the time.

you need to measure it over longer periods of time. Its not worth 
balancing for such a thing in any high-frequency manner. (we'd trash the 
cache constantly migrating tasks back and forth.)

> [time 2]
>  3721 tongli20   0  7864  108   16 R   91  0.0   1:44.67 loop
>  3720 tongli20   0  7864  108   16 R   89  0.0   1:40.55 loop
>  3722 tongli20   0  7864  108   16 R   86  0.0   1:41.40 loop
>  3723 tongli20   0  7864  108   16 R   82  0.0   1:38.94 loop
>  3715 tongli20   0  7864  108   16 R   82  0.0   1:43.09 loop
>  3717 tongli20   0  7864  108   16 R   79  0.0   1:46.38 loop
>  3718 tongli20   0  7864  108   16 R   79  0.0   1:40.72 loop
>  3714 tongli20   0  7864  108   16 R   77  0.0   1:40.17 loop
>  3719 tongli20   0  7864  108   16 R   73  0.0   1:35.41 loop
>  3716 tongli20   0  7864  108   16 R   61  0.0   1:38.66 loop

the runtime ranges from 1:38.66 to 1:46.38 - that's a +-3% spread which 
is pretty acceptable for ~90 seconds of runtime. Note that the 
percentages you see above were likely done over a shorter period that's 
why you see them range from 61% to 91%.

> I'm running 2.6.23-rc1 with your patch on a two-socket quad-core 
> system. Top refresh rate is the default 3s.

the 3s is the problem: change that to 60s! We no way want to 
over-migrate for SMP fairness, the change i did gives us reasonable 
long-term SMP fairness without the need for high-rate rebalancing.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Ingo Molnar

* Li, Tong N <[EMAIL PROTECTED]> wrote:

> The problem I see is that the current load balancing code is quite 
> hueristic and can be quite inaccurate sometimes. Its goal is to 
> maintain roughly equal load on each core, where load is defined to be 
> the sum of the weights of all tasks on a core. If all tasks have the 
> same weight, this is a simple problem. If different weights exist, 
> this is an NP-hard problem and our current hueristic can perform badly 
> under various workloads. A simple example, if we have four tasks on 
> two cores and they have weights 1, 5, 7, 7. The balanced partition 
> would have 1 and 7 on core 1 and 5 and 7 on core 2. But you can see 
> the load isn't evenly distributed; in fact, it's not possible to 
> evenly distribute the load in this case. Thus, my opinion is that only 
> balancing the load as we do now is not enough to achieve SMP fairness. 
> My patch was intended to address this problem.

could you please try my patch nevertheless and see how much it differs 
in practice from your solution, and cite specific examples and 
measurements? (please measure over longer periods of time such as 120 
seconds or 300 seconds - so that slow-rate balancing has its chance to 
work)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Li, Tong N
On Wed, 2007-07-25 at 14:03 +0200, Ingo Molnar wrote:

> Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
> ---
>  include/linux/sched.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux/include/linux/sched.h
> ===
> --- linux.orig/include/linux/sched.h
> +++ linux/include/linux/sched.h
> @@ -681,7 +681,7 @@ enum cpu_idle_type {
>  #define SCHED_LOAD_SHIFT 10
>  #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
>  
> -#define SCHED_LOAD_SCALE_FUZZ(SCHED_LOAD_SCALE >> 5)
> +#define SCHED_LOAD_SCALE_FUZZ(SCHED_LOAD_SCALE >> 1)
>  
>  #ifdef CONFIG_SMP
>  #define SD_LOAD_BALANCE  1   /* Do load balancing on this 
> domain. */

Hi Ingo,

The problem I see is that the current load balancing code is quite
hueristic and can be quite inaccurate sometimes. Its goal is to maintain
roughly equal load on each core, where load is defined to be the sum of
the weights of all tasks on a core. If all tasks have the same weight,
this is a simple problem. If different weights exist, this is an NP-hard
problem and our current hueristic can perform badly under various
workloads. A simple example, if we have four tasks on two cores and they
have weights 1, 5, 7, 7. The balanced partition would have 1 and 7 on
core 1 and 5 and 7 on core 2. But you can see the load isn't evenly
distributed; in fact, it's not possible to evenly distribute the load in
this case. Thus, my opinion is that only balancing the load as we do now
is not enough to achieve SMP fairness. My patch was intended to address
this problem.

  tong

PS. I now have a lockless implementation of my patch. Will post later
today.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Tong Li

On Wed, 25 Jul 2007, Ingo Molnar wrote:


they now nicely converte to the expected 80% long-term CPU usage.

so, could you please try the patch below, does it work for you too?



Thanks for the patch. It doesn't work well on my 8-way box. Here's the 
output at two different times. It's also changing all the time.


[time 1]
 3717 tongli20   0  7864  108   16 R   85  0.0   1:38.00 loop
 3721 tongli20   0  7864  108   16 R   85  0.0   1:37.51 loop
 3723 tongli20   0  7864  108   16 R   84  0.0   1:30.48 loop
 3714 tongli20   0  7864  108   16 R   83  0.0   1:33.25 loop
 3720 tongli20   0  7864  108   16 R   81  0.0   1:31.89 loop
 3715 tongli20   0  7864  108   16 R   78  0.0   1:36.38 loop
 3719 tongli20   0  7864  108   16 R   78  0.0   1:28.80 loop
 3718 tongli20   0  7864  108   16 R   78  0.0   1:33.84 loop
 3722 tongli20   0  7864  108   16 R   76  0.0   1:34.69 loop
 3716 tongli20   0  7864  108   16 R   70  0.0   1:33.15 loop

[time 2]
 3721 tongli20   0  7864  108   16 R   91  0.0   1:44.67 loop
 3720 tongli20   0  7864  108   16 R   89  0.0   1:40.55 loop
 3722 tongli20   0  7864  108   16 R   86  0.0   1:41.40 loop
 3723 tongli20   0  7864  108   16 R   82  0.0   1:38.94 loop
 3715 tongli20   0  7864  108   16 R   82  0.0   1:43.09 loop
 3717 tongli20   0  7864  108   16 R   79  0.0   1:46.38 loop
 3718 tongli20   0  7864  108   16 R   79  0.0   1:40.72 loop
 3714 tongli20   0  7864  108   16 R   77  0.0   1:40.17 loop
 3719 tongli20   0  7864  108   16 R   73  0.0   1:35.41 loop
 3716 tongli20   0  7864  108   16 R   61  0.0   1:38.66 loop

I'm running 2.6.23-rc1 with your patch on a two-socket quad-core system. 
Top refresh rate is the default 3s.


  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> > This patch extends CFS to achieve better fairness for SMPs. For 
> > example, with 10 tasks (same priority) on 8 CPUs, it enables each task 
> > to receive equal CPU time (80%). [...]
> 
> hm, CFS should already offer reasonable long-term SMP fairness. It 
> certainly works on a dual-core box, i just started 3 tasks of the same 
> priority on 2 CPUs, and on vanilla 2.6.23-rc1 the distribution is 
> this:
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  7084 mingo 20   0  1576  248  196 R   67  0.0   0:50.13 loop
>  7083 mingo 20   0  1576  244  196 R   66  0.0   0:48.86 loop
>  7085 mingo 20   0  1576  244  196 R   66  0.0   0:49.45 loop
> 
> so each task gets a perfect 66% of CPU time.
> 
> prior CFS, we indeed did a 50%/50%/100% split - so for example on 
> v2.6.22:
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  2256 mingo 25   0  1580  248  196 R  100  0.0   1:03.19 loop
>  2255 mingo 25   0  1580  248  196 R   50  0.0   0:31.79 loop
>  2257 mingo 25   0  1580  248  196 R   50  0.0   0:31.69 loop
> 
> but CFS has changed that behavior.
> 
> I'll check your 10-tasks-on-8-cpus example on an 8-way box too, maybe 
> we regressed somewhere ...

ok, i just tried it on an 8-cpu box and indeed, unlike the dual-core 
case, the scheduler does not distribute tasks well enough:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 2572 mingo 20   0  1576  244  196 R  100  0.0   1:03.61 loop
 2578 mingo 20   0  1576  248  196 R  100  0.0   1:03.59 loop
 2576 mingo 20   0  1576  248  196 R  100  0.0   1:03.52 loop
 2571 mingo 20   0  1576  244  196 R  100  0.0   1:03.46 loop
 2569 mingo 20   0  1576  244  196 R   99  0.0   1:03.36 loop
 2570 mingo 20   0  1576  244  196 R   95  0.0   1:00.55 loop
 2577 mingo 20   0  1576  248  196 R   50  0.0   0:31.88 loop
 2574 mingo 20   0  1576  248  196 R   50  0.0   0:31.87 loop
 2573 mingo 20   0  1576  248  196 R   50  0.0   0:31.86 loop
 2575 mingo 20   0  1576  248  196 R   50  0.0   0:31.86 loop

but this is relatively easy to fix - with the patch below applied, it 
looks a lot better:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 2681 mingo 20   0  1576  244  196 R   85  0.0   3:51.68 loop
 2688 mingo 20   0  1576  244  196 R   81  0.0   3:46.35 loop
 2682 mingo 20   0  1576  244  196 R   80  0.0   3:43.68 loop
 2685 mingo 20   0  1576  248  196 R   80  0.0   3:45.97 loop
 2683 mingo 20   0  1576  248  196 R   80  0.0   3:40.25 loop
 2679 mingo 20   0  1576  244  196 R   80  0.0   3:33.53 loop
 2680 mingo 20   0  1576  244  196 R   79  0.0   3:43.53 loop
 2686 mingo 20   0  1576  244  196 R   79  0.0   3:39.31 loop
 2687 mingo 20   0  1576  244  196 R   78  0.0   3:33.31 loop
 2684 mingo 20   0  1576  244  196 R   77  0.0   3:27.52 loop

they now nicely converte to the expected 80% long-term CPU usage.

so, could you please try the patch below, does it work for you too?

Ingo

--->
Subject: sched: increase SCHED_LOAD_SCALE_FUZZ
From: Ingo Molnar <[EMAIL PROTECTED]>

increase SCHED_LOAD_SCALE_FUZZ that adds a small amount of
over-balancing: to help distribute CPU-bound tasks more fairly on SMP
systems.

the problem of unfair balancing was noticed and reported by Tong N Li.

10 CPU-bound tasks running on 8 CPUs, v2.6.23-rc1:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 2572 mingo 20   0  1576  244  196 R  100  0.0   1:03.61 loop
 2578 mingo 20   0  1576  248  196 R  100  0.0   1:03.59 loop
 2576 mingo 20   0  1576  248  196 R  100  0.0   1:03.52 loop
 2571 mingo 20   0  1576  244  196 R  100  0.0   1:03.46 loop
 2569 mingo 20   0  1576  244  196 R   99  0.0   1:03.36 loop
 2570 mingo 20   0  1576  244  196 R   95  0.0   1:00.55 loop
 2577 mingo 20   0  1576  248  196 R   50  0.0   0:31.88 loop
 2574 mingo 20   0  1576  248  196 R   50  0.0   0:31.87 loop
 2573 mingo 20   0  1576  248  196 R   50  0.0   0:31.86 loop
 2575 mingo 20   0  1576  248  196 R   50  0.0   0:31.86 loop

v2.6.23-rc1 + patch:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 2681 mingo 20   0  1576  244  196 R   85  0.0   3:51.68 loop
 2688 mingo 20   0  1576  244  196 R   81  0.0   3:46.35 loop
 2682 mingo 20   0  1576  244  196 R   80  0.0   3:43.68 loop
 2685 mingo 20   0  1576  248  196 R   80  0.0   3:45.97 loop
 2683 mingo 20   0  1576  248  196 R   80  0.0   3:40.25 loop
 2679 mingo 20   0  1576  244  196 R   80  0.0   3:33.53 loop
 2680 mingo 20   0  1576  244  196 R   79  0.0   3:43.53 loop
 2686 mingo 20   0  1576  244  196 R   79  0.0   3:39.31 loop
 2687 mingo 20   0  1576  244  196 R   78  0.0   3:33.31 loop
 2684 mingo 20   0  1576  244  196 R   77  0.0   3:27.52 loop

so they now nicel

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-25 Thread Ingo Molnar

* Tong Li <[EMAIL PROTECTED]> wrote:

> This patch extends CFS to achieve better fairness for SMPs. For 
> example, with 10 tasks (same priority) on 8 CPUs, it enables each task 
> to receive equal CPU time (80%). [...]

hm, CFS should already offer reasonable long-term SMP fairness. It 
certainly works on a dual-core box, i just started 3 tasks of the same 
priority on 2 CPUs, and on vanilla 2.6.23-rc1 the distribution is this:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 7084 mingo 20   0  1576  248  196 R   67  0.0   0:50.13 loop
 7083 mingo 20   0  1576  244  196 R   66  0.0   0:48.86 loop
 7085 mingo 20   0  1576  244  196 R   66  0.0   0:49.45 loop

so each task gets a perfect 66% of CPU time.

prior CFS, we indeed did a 50%/50%/100% split - so for example on 
v2.6.22:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 2256 mingo 25   0  1580  248  196 R  100  0.0   1:03.19 loop
 2255 mingo 25   0  1580  248  196 R   50  0.0   0:31.79 loop
 2257 mingo 25   0  1580  248  196 R   50  0.0   0:31.69 loop

but CFS has changed that behavior.

I'll check your 10-tasks-on-8-cpus example on an 8-way box too, maybe we 
regressed somewhere ...

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Friesen

Chris Snook wrote:

A fraction of *each* CPU, or a fraction of *total* CPU?  Per-cpu 
granularity doesn't make anything more fair.


Well, our current solution uses per-cpu weights, because our vendor 
couldn't get the load balancer working accurately enough.  Having 
per-cpu weights and cpu affinity gives acceptable results for the case 
where we're currently using it.


If the load balancer is good enough, per-system weights would be fine. 
It would have to play nicely with affinity though, in the case where it 
makes sense to lock tasks to particular cpus.


If I have two threads with the same priority, and two CPUs, the 
scheduler will put one on each CPU, and they'll run happily without any 
migration or balancing.


Sure.  Now add a third thread.  How often do you migrate?  Put another 
way, over what time quantum do we ensure fairness?


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread hui
On Tue, Jul 24, 2007 at 05:22:47PM -0400, Chris Snook wrote:
> Bill Huey (hui) wrote:
> Well, you need enough CPU time to meet your deadlines.  You need 
> pre-allocated memory, or to be able to guarantee that you can allocate 
> memory fast enough to meet your deadlines.  This principle extends to any 
> other shared resource, such as disk or network.  I'm being vague because 
> it's open-ended.  If a medical device fails to meet realtime guarantees 
> because the battery fails, the patient's family isn't going to care how 
> correct the software is.  Realtime engineering is hard.
...
> Actually, it's worse than merely an open problem.  A clairvoyant fair 
> scheduler with perfect future knowledge can underperform a heuristic fair 
> scheduler, because the heuristic scheduler can guess the future incorrectly 
> resulting in unfair but higher-throughput behavior.  This is a perfect 
> example of why we only try to be as fair as is beneficial.

I'm glad we agree on the above points. :)

It might be that there needs to be another more stiff policy than what goes
into SCHED_OTHER in that we also need a SCHED_ISO or something has more
strict rebalancing semantics for -rt applications, sort be a super SCHED_RR.
That's definitely needed and I don't see how the current CFS implementation
can deal with this properly even with numerical running averages, etc...
at this time.

SCHED_FIFO is another issue, but this actually more complicated than just
per cpu run queues in that a global priority analysis. I don't see how
CFS can deal with SCHED_FIFO efficiently without moving to a single run
queue. This is kind of a complicated problem with a significant set of
trade off to take into account (cpu binding, etc..)

>> Tong's previous trio patch is an attempt at resolving this using a generic
>> grouping mechanism and some constructive discussion should come of it.
>
> Sure, but it seems to me to be largely orthogonal to this patch.

It's based on the same kinds of ideas that he's been experimenting with in
Trio. I can't name a single other engineer that's posted to lkml recently
that has quite the depth of experience in this area than him. It would be
nice to facilitted/incorporate some his ideas or get him to and work on
something to this end that's suitable for inclusion in some tree some where.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Chris Friesen wrote:

Chris Snook wrote:

I don't think Chris's scenario has much bearing on your patch.  What 
he wants is to have a task that will always be running, but can't 
monopolize either CPU. This is useful for certain realtime workloads, 
but as I've said before, realtime requires explicit resource 
allocation.  I don't think this is very relevant to SCHED_FAIR balancing.


I'm not actually using the scenario I described, its just sort of a 
worst-case load-balancing thought experiment.


What we want to be able to do is to specify a fraction of each cpu for 
each task group.  We don't want to have to affine tasks to particular cpus.


A fraction of *each* CPU, or a fraction of *total* CPU?  Per-cpu granularity 
doesn't make anything more fair.  You've got a big bucket of MIPS you want to 
divide between certain groups, but it shouldn't make a difference which CPUs 
those MIPS come from, other than the fact that we try to minimize overhead 
induced by migration.


This means that the load balancer must be group-aware, and must trigger 
a re-balance (possibly just for a particular group) as soon as the cpu 
allocation for that group is used up on a particular cpu.


If I have two threads with the same priority, and two CPUs, the scheduler will 
put one on each CPU, and they'll run happily without any migration or balancing. 
 It sounds like you're saying that every X milliseconds, you want both to 
expire, be forbidden from running on the current CPU for the next X 
milliseconds, and then migrated to the other CPU.  There's no gain in fairness 
here, and there's a big drop in performance.


I suggested local fairness as a means to achieve global fairness because it 
could reduce overhead, and by adding the margin of error at each level in the 
locality hierarchy, you can get an algorithm which naturally tolerates the level 
of unfairness beyond which it is impossible to optimize.  Strict local fairness 
for its own sake doesn't accomplish anything that's better than global fairness.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread hui
On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote:
> Chris Friesen wrote:
>> We currently use CKRM on an SMP machine, but the only way we can get away 
>> with it is because our main app is affined to one cpu and just about 
>> everything else is affined to the other.
>
> If you're not explicitly allocating resources, you're just low-latency, not 
> truly realtime.  Realtime requires guaranteed resources, so messing with 
> affinities is a necessary evil.

You've mentioned this twice in this thread. If you're going to talk about this
you should characterize this more specifically because resource allocation is
a rather incomplete area in the Linux. Rebalancing is still an open research
problem the last time I looked.

Tong's previous trio patch is an attempt at resolving this using a generic
grouping mechanism and some constructive discussion should come of it.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Friesen

Chris Snook wrote:

I don't think Chris's scenario has much bearing on your patch.  What he 
wants is to have a task that will always be running, but can't 
monopolize either CPU. This is useful for certain realtime workloads, 
but as I've said before, realtime requires explicit resource 
allocation.  I don't think this is very relevant to SCHED_FAIR balancing.


I'm not actually using the scenario I described, its just sort of a 
worst-case load-balancing thought experiment.


What we want to be able to do is to specify a fraction of each cpu for 
each task group.  We don't want to have to affine tasks to particular cpus.


This means that the load balancer must be group-aware, and must trigger 
a re-balance (possibly just for a particular group) as soon as the cpu 
allocation for that group is used up on a particular cpu.


I haven't tried the latest CFS group scheduler patches, so they may 
provide the sort of accuracy we're looking for.  I'll have to try and 
find some time to test them out.


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Bill Huey (hui) wrote:

On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote:

Chris Friesen wrote:
We currently use CKRM on an SMP machine, but the only way we can get away 
with it is because our main app is affined to one cpu and just about 
everything else is affined to the other.
If you're not explicitly allocating resources, you're just low-latency, not 
truly realtime.  Realtime requires guaranteed resources, so messing with 
affinities is a necessary evil.


You've mentioned this twice in this thread. If you're going to talk about this
you should characterize this more specifically because resource allocation is
a rather incomplete area in the Linux.


Well, you need enough CPU time to meet your deadlines.  You need pre-allocated 
memory, or to be able to guarantee that you can allocate memory fast enough to 
meet your deadlines.  This principle extends to any other shared resource, such 
as disk or network.  I'm being vague because it's open-ended.  If a medical 
device fails to meet realtime guarantees because the battery fails, the 
patient's family isn't going to care how correct the software is.  Realtime 
engineering is hard.



Rebalancing is still an open research
problem the last time I looked.


Actually, it's worse than merely an open problem.  A clairvoyant fair scheduler 
with perfect future knowledge can underperform a heuristic fair scheduler, 
because the heuristic scheduler can guess the future incorrectly resulting in 
unfair but higher-throughput behavior.  This is a perfect example of why we only 
try to be as fair as is beneficial.



Tong's previous trio patch is an attempt at resolving this using a generic
grouping mechanism and some constructive discussion should come of it.


Sure, but it seems to me to be largely orthogonal to this patch.

-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Friesen

Chris Snook wrote:

We have another SMP box that would benefit from group scheduling, but 
we can't use it because the load balancer is not nearly good enough.



Which scheduler?  Have you tried the CFS group scheduler patches?


CKRM as well.

Haven't tried the CFS group scheduler, as we're stuck on a way old 
kernel and I've been too busy to play with the new stuff.


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Li, Tong N wrote:

On Tue, 2007-07-24 at 16:39 -0400, Chris Snook wrote:

Divining the intentions of the administrator is an AI-complete problem and we're 
not going to try to solve that in the kernel.  An intelligent administrator 
could also allocate 50% of each CPU to a resource group containing all the 
*other* processes.  Then, when the other processes are scheduled out, your 
single task will run on whichever CPU is idle.  This will very quickly 
equilibrate to the scheduling ping-pong you seem to want.  The scheduler 
deliberately avoids this kind of migration by default because it hurts cache and 
TLB performance, so if you want to override this very sane default behavior, 
you're going to have to explicitly configure it yourself.




Well, the admin wouldn't specifically ask for 50% of each CPU. He would
just allocate 50% of total CPU time---it's up to the scheduler to
fulfill that. If a task is entitled to one CPU, then it'll stay there
and have no migration. Migration occurs only if there's overload, in
which case I think you agree in your last email that the cache and TLB
impact is not an issue (at least in SMP).


I don't think Chris's scenario has much bearing on your patch.  What he wants is 
to have a task that will always be running, but can't monopolize either CPU. 
This is useful for certain realtime workloads, but as I've said before, realtime 
requires explicit resource allocation.  I don't think this is very relevant to 
SCHED_FAIR balancing.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Li, Tong N
On Tue, 2007-07-24 at 16:39 -0400, Chris Snook wrote:

> Divining the intentions of the administrator is an AI-complete problem and 
> we're 
> not going to try to solve that in the kernel.  An intelligent administrator 
> could also allocate 50% of each CPU to a resource group containing all the 
> *other* processes.  Then, when the other processes are scheduled out, your 
> single task will run on whichever CPU is idle.  This will very quickly 
> equilibrate to the scheduling ping-pong you seem to want.  The scheduler 
> deliberately avoids this kind of migration by default because it hurts cache 
> and 
> TLB performance, so if you want to override this very sane default behavior, 
> you're going to have to explicitly configure it yourself.
> 

Well, the admin wouldn't specifically ask for 50% of each CPU. He would
just allocate 50% of total CPU time---it's up to the scheduler to
fulfill that. If a task is entitled to one CPU, then it'll stay there
and have no migration. Migration occurs only if there's overload, in
which case I think you agree in your last email that the cache and TLB
impact is not an issue (at least in SMP).

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Chris Friesen wrote:

Chris Snook wrote:

Concerns aside, I agree that fairness is important, and I'd really 
like to see a test case that demonstrates the problem.


One place that might be useful is the case of fairness between resource 
groups, where the load balancer needs to consider each group separately.


You mean like the CFS group scheduler patches?  I don't see how this patch is 
related to that, besides working on top of it.


Now it may be the case that trying to keep the load of each class within 
X% of the other cpus is sufficient, but it's not trivial.


I agree.  My suggestion is that we try being fair from the bottom-up, rather 
than top-down.  If most of the rebalancing is local, we can minimize expensive 
locking and cross-node migrations, and scale very nicely on large NUMA boxes.


Consider the case where you have a resource group that is allocated 50% 
of each cpu in a dual cpu system, and only have a single task in that 
group.  This means that in order to make use of the full group 
allocation, that task needs to be load-balanced to the other cpu as soon 
as it gets scheduled out.  Most load-balancers can't handle that kind of 
granularity, but I have guys in our engineering team that would really 
like this level of performance.


Divining the intentions of the administrator is an AI-complete problem and we're 
not going to try to solve that in the kernel.  An intelligent administrator 
could also allocate 50% of each CPU to a resource group containing all the 
*other* processes.  Then, when the other processes are scheduled out, your 
single task will run on whichever CPU is idle.  This will very quickly 
equilibrate to the scheduling ping-pong you seem to want.  The scheduler 
deliberately avoids this kind of migration by default because it hurts cache and 
TLB performance, so if you want to override this very sane default behavior, 
you're going to have to explicitly configure it yourself.


We currently use CKRM on an SMP machine, but the only way we can get 
away with it is because our main app is affined to one cpu and just 
about everything else is affined to the other.


If you're not explicitly allocating resources, you're just low-latency, not 
truly realtime.  Realtime requires guaranteed resources, so messing with 
affinities is a necessary evil.


We have another SMP box that would benefit from group scheduling, but we 
can't use it because the load balancer is not nearly good enough.


Which scheduler?  Have you tried the CFS group scheduler patches?

-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Friesen

Chris Snook wrote:

Concerns aside, I agree that fairness is important, and I'd really like 
to see a test case that demonstrates the problem.


One place that might be useful is the case of fairness between resource 
groups, where the load balancer needs to consider each group separately.


Now it may be the case that trying to keep the load of each class within 
X% of the other cpus is sufficient, but it's not trivial.


Consider the case where you have a resource group that is allocated 50% 
of each cpu in a dual cpu system, and only have a single task in that 
group.  This means that in order to make use of the full group 
allocation, that task needs to be load-balanced to the other cpu as soon 
as it gets scheduled out.  Most load-balancers can't handle that kind of 
granularity, but I have guys in our engineering team that would really 
like this level of performance.


We currently use CKRM on an SMP machine, but the only way we can get 
away with it is because our main app is affined to one cpu and just 
about everything else is affined to the other.


We have another SMP box that would benefit from group scheduling, but we 
can't use it because the load balancer is not nearly good enough.


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Tong Li wrote:

On Mon, 23 Jul 2007, Chris Snook wrote:

This patch is massive overkill.  Maybe you're not seeing the overhead 
on your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a 
partially-RT workload.  Do you have any data justifying the need for 
this patch?


Doing anything globally is expensive, and should be avoided at all 
costs. The scheduler already rebalances when a CPU is idle, so you're 
really just rebalancing the overload here.  On a server workload, we 
don't necessarily want to do that, since the overload may be multiple 
threads spawned to service a single request, and could be sharing a 
lot of data.


Instead of an explicit system-wide fairness invariant (which will get 
very hard to enforce when you throw SCHED_FIFO processes into the mix 
and the scheduler isn't running on some CPUs), try a simpler 
invariant.  If we guarantee that the load on CPU X does not differ 
from the load on CPU (X+1)%N by more than some small constant, then we 
know that the system is fairly balanced.  We can achieve global 
fairness with local balancing, and avoid all this overhead.  This has 
the added advantage of keeping most of the migrations 
core/socket/node-local on SMT/multicore/NUMA systems.




Chris,

These are all good comments. Thanks. I see three concerns and I'll try 
to address each.


1. Unjustified effort/cost

My view is that fairness (or proportional fairness) is a first-order 
metric and necessary in many cases even at the cost of performance.


In the cases where it's critical, we have realtime.  In the cases where it's 
important, this implementation won't keep latency low enough to make people 
happier.  If you've got a test case to prove me wrong, I'd like to see it.


A 
server running multiple client apps certainly doesn't want the clients 
to see that they are getting different amounts of service, assuming the 
clients are of equal importance (priority).


A conventional server receives client requests, does a brief amount of work, and 
then gives a response.  This patch doesn't help that workload.  This patch helps 
the case where you've got batch jobs running on a slightly overloaded compute 
server, and unfairness means you end up waiting for a couple threads to finish 
at the end while CPUs sit idle.  I don't think it's that big of a problem, and 
if it is, I think we can solve it in a more elegant way than reintroducing 
expired queues.


When the clients have 
different priorities, the server also wants to give them service time 
proportional to their priority/weight. The same is true for desktops, 
where users want to nice tasks and see an effect that's consistent with 
what they expect, i.e., task CPU time should be proportional to their 
nice values. The point is that it's important to enforce fairness 
because it enables users to control the system in a deterministic way 
and it helps each task get good response time. CFS achieves this on 
local CPUs and this patch makes the support stronger for SMPs. It's 
overkill to enforce unnecessary degree of fairness, but it is necessary 
to enforce an error bound, even if large, such that the user can 
reliably know what kind of CPU time (even performance) he'd get after 
making a nice value change.


Doesn't CFS already do this?

This patch ensures an error bound of (max 
task weight currently in system) * sysctl_base_round_slice compared to 
an idealized fair system.


The thing that bugs me about this is the diminishing returns.  It looks like it 
will only give a substantial benefit when system load is somewhere between 1.0 
and 2.0.  On a heavily-loaded system, CFS will do the right thing within a good 
margin of error, and on an underloaded system, even a naive scheduler will do 
the right thing.  If you want to optimize smp fairness in this range, that's 
great, but there's probably a lighter-weight way to do it.



2. High performance overhead

Two sources of overhead: (1) the global rw_lock, and (2) task 
migrations. I agree they can be problems on NUMA, but I'd argue they are 
not on SMPs. Any global lock can cause two performance problems: (1) 
serialization, and (2) excessive remote cache accesses and traffic. IMO 
(1) is not a problem since this is a rw_lock and a write_lock occurs 
infrequently only when all tasks in the system finish the current round. 
(2) could be a problem as every read/write lock causes an invalidation. 
It could be improved by using Nick's ticket lock. On the other hand, 
this is a single cache line and it's invalidated only when a CPU 
finishes all tasks in its local active RB tree, where each nice 0 task 
takes sysctl_base_round_slice (e.g., 30ms) to finish, so it looks to me 
the invalidations would be infrequent enough and could be noise in the 
whole system.


Task migrations don't bother me all that much.  Since we're migrating the 
*overload*, I expect those processes to be fairly cache-cold whenever we get 
around to them anyway.  It'd be nice to be SMT/multicore/NU

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Li, Tong N
On Tue, 2007-07-24 at 04:07 -0400, Chris Snook wrote:

> To clarify, I'm not suggesting that the "balance with cpu (x+1)%n only" 
> algorithm is the only way to do this.  Rather, I'm pointing out that 
> even an extremely simple algorithm can give you fair loading when you 
> already have CFS managing the runqueues.  There are countless more 
> sophisticated ways we could do this without using global locking, or 
> possibly without any locking at all, other than the locking we already 
> use during migration.
> 
>   -- Chris

Yes, as Andi and Chris also pointed out, I'll think about if global
synchronization can be removed or relaxed.

Thanks,

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Tong Li

On Mon, 23 Jul 2007, Chris Snook wrote:

This patch is massive overkill.  Maybe you're not seeing the overhead on your 
8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT 
workload.  Do you have any data justifying the need for this patch?


Doing anything globally is expensive, and should be avoided at all costs. 
The scheduler already rebalances when a CPU is idle, so you're really just 
rebalancing the overload here.  On a server workload, we don't necessarily 
want to do that, since the overload may be multiple threads spawned to 
service a single request, and could be sharing a lot of data.


Instead of an explicit system-wide fairness invariant (which will get very 
hard to enforce when you throw SCHED_FIFO processes into the mix and the 
scheduler isn't running on some CPUs), try a simpler invariant.  If we 
guarantee that the load on CPU X does not differ from the load on CPU (X+1)%N 
by more than some small constant, then we know that the system is fairly 
balanced.  We can achieve global fairness with local balancing, and avoid all 
this overhead.  This has the added advantage of keeping most of the 
migrations core/socket/node-local on SMT/multicore/NUMA systems.




Chris,

These are all good comments. Thanks. I see three concerns and I'll try to 
address each.


1. Unjustified effort/cost

My view is that fairness (or proportional fairness) is a first-order 
metric and necessary in many cases even at the cost of performance. A 
server running multiple client apps certainly doesn't want the clients to 
see that they are getting different amounts of service, assuming the 
clients are of equal importance (priority). When the clients have 
different priorities, the server also wants to give them service time 
proportional to their priority/weight. The same is true for desktops, 
where users want to nice tasks and see an effect that's consistent with 
what they expect, i.e., task CPU time should be proportional to their nice 
values. The point is that it's important to enforce fairness because it 
enables users to control the system in a deterministic way and it helps 
each task get good response time. CFS achieves this on local CPUs and this 
patch makes the support stronger for SMPs. It's overkill to enforce 
unnecessary degree of fairness, but it is necessary to enforce an error 
bound, even if large, such that the user can reliably know what kind of 
CPU time (even performance) he'd get after making a nice value change. 
This patch ensures an error bound of (max task weight currently in system) 
* sysctl_base_round_slice compared to an idealized fair system.


2. High performance overhead

Two sources of overhead: (1) the global rw_lock, and (2) task migrations. 
I agree they can be problems on NUMA, but I'd argue they are not on SMPs. 
Any global lock can cause two performance problems: (1) serialization, and 
(2) excessive remote cache accesses and traffic. IMO (1) is not a problem 
since this is a rw_lock and a write_lock occurs infrequently only when all 
tasks in the system finish the current round. (2) could be a problem as 
every read/write lock causes an invalidation. It could be improved by 
using Nick's ticket lock. On the other hand, this is a single cache line 
and it's invalidated only when a CPU finishes all tasks in its local 
active RB tree, where each nice 0 task takes sysctl_base_round_slice 
(e.g., 30ms) to finish, so it looks to me the invalidations would be 
infrequent enough and could be noise in the whole system.


The patch can introduce more task migrations. I don't think it's a problem 
in SMPs. For one, CFS already is doing context switches more frequently 
than before, and thus even if a task doesn't migration, it may still miss 
in the local cache because the previous task kicked out its data. In this 
sense, migration doesn't add more cache misses. Now, in NUMA, the penalty 
of a cache miss can be much higher if we migrate off-node. Here, I agree 
that this can be a problem. For NUMA, I'd expect the user to tune 
sysctl_base_round_slice to a large enough value to avoid frequent 
migrations (or just affinitize tasks). Benchmarking could also help us 
determine a reasonable default sysctl_base_round_slice for NUMA.


3. Hard to enforce fairness when there are non-SCHED_FAIR tasks

All we want is to enforce fairness among the SCHED_FAIR tasks. Leveraging 
CFS, this patch makes sure relative CPU time of two SCHED_FAIR tasks in an 
SMP equals their weight ratio. In other words, if the entire SCHED_FAIR 
tasks are given X amount of CPU time, then a weight w task is guaranteed 
with w/X time in any interval of time.


  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Andi Kleen
On Mon, Jul 23, 2007 at 03:25:12PM -0600, Chris Friesen wrote:
> Li, Tong N wrote:
> >On the other hand, if locking does
> >become a problem for certain systems/workloads, increasing
> >sysctl_base_round_slice can reduce the locking frequency and alleviate
> >the problem, at the cost of being relatively less fair across the CPUs. 
> 
> If locking does become a problem, it may make sense to switch instead to 
> a lockless algorithm of some sort...

That usually has the same issue with cache lines. It would be better
to analyze first if such global synchronization is really needed.
I hope not.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread Chris Snook

Chris Snook wrote:

Tong Li wrote:
This patch extends CFS to achieve better fairness for SMPs. For 
example, with 10 tasks (same priority) on 8 CPUs, it enables each task 
to receive equal CPU time (80%). The code works on top of CFS and 
provides SMP fairness at a coarser time grainularity; local on each 
CPU, it relies on CFS to provide fine-grained fairness and good 
interactivity.


The code is based on the distributed weighted round-robin (DWRR) 
algorithm. It keeps two RB trees on each CPU: one is the original 
cfs_rq, referred to as active, and one is a new cfs_rq, called 
round-expired. Each CPU keeps a round number, initially zero. The 
scheduler works exactly the same way as in CFS, but only runs tasks 
from the active tree. Each task is assigned a round slice, equal to 
its weight times a system constant (e.g., 100ms), controlled by 
sysctl_base_round_slice. When a task uses up its round slice, it moves 
to the round-expired tree on the same CPU and stops running. Thus, at 
any time on each CPU, the active tree contains all tasks that are 
running in the current round, while tasks in round-expired have all 
finished the current round and await to start the next round. When an 
active tree becomes empty, it calls idle_balance() to grab tasks of 
the same round from other CPUs. If none can be moved over, it switches 
its active and round-expired trees, thus unleashing round-expired 
tasks and advancing the local round number by one. An invariant it 
maintains is that the round numbers of any two CPUs in the system 
differ by at most one. This property ensures fairness across CPUs. The 
variable sysctl_base_round_slice controls fairness-performance 
tradeoffs: a smaller value leads to better cross-CPU fairness at the 
potential cost of performance; on the other hand, the larger the value 
is, the closer the system behavior is to the default CFS without the 
patch.


Any comments and suggestions would be highly appreciated.


This patch is massive overkill.  Maybe you're not seeing the overhead on 
your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a 
partially-RT workload.  Do you have any data justifying the need for 
this patch?


Doing anything globally is expensive, and should be avoided at all 
costs.  The scheduler already rebalances when a CPU is idle, so you're 
really just rebalancing the overload here.  On a server workload, we 
don't necessarily want to do that, since the overload may be multiple 
threads spawned to service a single request, and could be sharing a lot 
of data.


Instead of an explicit system-wide fairness invariant (which will get 
very hard to enforce when you throw SCHED_FIFO processes into the mix 
and the scheduler isn't running on some CPUs), try a simpler invariant.  
If we guarantee that the load on CPU X does not differ from the load on 
CPU (X+1)%N by more than some small constant, then we know that the 
system is fairly balanced.  We can achieve global fairness with local 
balancing, and avoid all this overhead.  This has the added advantage of 
keeping most of the migrations core/socket/node-local on 
SMT/multicore/NUMA systems.


-- Chris


To clarify, I'm not suggesting that the "balance with cpu (x+1)%n only" 
algorithm is the only way to do this.  Rather, I'm pointing out that 
even an extremely simple algorithm can give you fair loading when you 
already have CFS managing the runqueues.  There are countless more 
sophisticated ways we could do this without using global locking, or 
possibly without any locking at all, other than the locking we already 
use during migration.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-23 Thread Chris Snook

Tong Li wrote:
This patch extends CFS to achieve better fairness for SMPs. For example, 
with 10 tasks (same priority) on 8 CPUs, it enables each task to receive 
equal CPU time (80%). The code works on top of CFS and provides SMP 
fairness at a coarser time grainularity; local on each CPU, it relies on 
CFS to provide fine-grained fairness and good interactivity.


The code is based on the distributed weighted round-robin (DWRR) 
algorithm. It keeps two RB trees on each CPU: one is the original 
cfs_rq, referred to as active, and one is a new cfs_rq, called 
round-expired. Each CPU keeps a round number, initially zero. The 
scheduler works exactly the same way as in CFS, but only runs tasks from 
the active tree. Each task is assigned a round slice, equal to its 
weight times a system constant (e.g., 100ms), controlled by 
sysctl_base_round_slice. When a task uses up its round slice, it moves 
to the round-expired tree on the same CPU and stops running. Thus, at 
any time on each CPU, the active tree contains all tasks that are 
running in the current round, while tasks in round-expired have all 
finished the current round and await to start the next round. When an 
active tree becomes empty, it calls idle_balance() to grab tasks of the 
same round from other CPUs. If none can be moved over, it switches its 
active and round-expired trees, thus unleashing round-expired tasks and 
advancing the local round number by one. An invariant it maintains is 
that the round numbers of any two CPUs in the system differ by at most 
one. This property ensures fairness across CPUs. The variable 
sysctl_base_round_slice controls fairness-performance tradeoffs: a 
smaller value leads to better cross-CPU fairness at the potential cost 
of performance; on the other hand, the larger the value is, the closer 
the system behavior is to the default CFS without the patch.


Any comments and suggestions would be highly appreciated.


This patch is massive overkill.  Maybe you're not seeing the overhead on your 
8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT 
workload.  Do you have any data justifying the need for this patch?


Doing anything globally is expensive, and should be avoided at all costs.  The 
scheduler already rebalances when a CPU is idle, so you're really just 
rebalancing the overload here.  On a server workload, we don't necessarily want 
to do that, since the overload may be multiple threads spawned to service a 
single request, and could be sharing a lot of data.


Instead of an explicit system-wide fairness invariant (which will get very hard 
to enforce when you throw SCHED_FIFO processes into the mix and the scheduler 
isn't running on some CPUs), try a simpler invariant.  If we guarantee that the 
load on CPU X does not differ from the load on CPU (X+1)%N by more than some 
small constant, then we know that the system is fairly balanced.  We can achieve 
global fairness with local balancing, and avoid all this overhead.  This has the 
added advantage of keeping most of the migrations core/socket/node-local on 
SMT/multicore/NUMA systems.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-23 Thread Chris Friesen

Li, Tong N wrote:

On the other hand, if locking does
become a problem for certain systems/workloads, increasing
sysctl_base_round_slice can reduce the locking frequency and alleviate
the problem, at the cost of being relatively less fair across the CPUs. 


If locking does become a problem, it may make sense to switch instead to 
a lockless algorithm of some sort...


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-23 Thread Li, Tong N
I benchmarked an early version of this code (against 2.6.21) with
SPECjbb, SPEComp, kernbench, etc. on an 8-processor system, and didn't
see any slowdown compared to the stock scheduler. I'll generate the data
again with this version of the code. On the other hand, if locking does
become a problem for certain systems/workloads, increasing
sysctl_base_round_slice can reduce the locking frequency and alleviate
the problem, at the cost of being relatively less fair across the CPUs. 

  tong

On Mon, 2007-07-23 at 22:00 +0200, Andi Kleen wrote:
> Tong Li <[EMAIL PROTECTED]> writes:
> > +
> > +read_lock_irqsave(&dwrr_highest_lock, flags);
> 
> That lock looks like it would bounce between CPUs like crazy.
> Did you try any benchmarks on a larger system? 
> 
> -Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-23 Thread Andi Kleen
Tong Li <[EMAIL PROTECTED]> writes:
> +
> +read_lock_irqsave(&dwrr_highest_lock, flags);

That lock looks like it would bounce between CPUs like crazy.
Did you try any benchmarks on a larger system? 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] scheduler: improve SMP fairness in CFS

2007-07-23 Thread Tong Li
This patch extends CFS to achieve better fairness for SMPs. For example, 
with 10 tasks (same priority) on 8 CPUs, it enables each task to receive 
equal CPU time (80%). The code works on top of CFS and provides SMP 
fairness at a coarser time grainularity; local on each CPU, it relies on 
CFS to provide fine-grained fairness and good interactivity.


The code is based on the distributed weighted round-robin (DWRR) 
algorithm. It keeps two RB trees on each CPU: one is the original cfs_rq, 
referred to as active, and one is a new cfs_rq, called round-expired. Each 
CPU keeps a round number, initially zero. The scheduler works exactly the 
same way as in CFS, but only runs tasks from the active tree. Each task is 
assigned a round slice, equal to its weight times a system constant (e.g., 
100ms), controlled by sysctl_base_round_slice. When a task uses up its 
round slice, it moves to the round-expired tree on the same CPU and stops 
running. Thus, at any time on each CPU, the active tree contains all tasks 
that are running in the current round, while tasks in round-expired have 
all finished the current round and await to start the next round. When an 
active tree becomes empty, it calls idle_balance() to grab tasks of the 
same round from other CPUs. If none can be moved over, it switches its 
active and round-expired trees, thus unleashing round-expired tasks and 
advancing the local round number by one. An invariant it maintains is that 
the round numbers of any two CPUs in the system differ by at most one. 
This property ensures fairness across CPUs. The variable 
sysctl_base_round_slice controls fairness-performance tradeoffs: a smaller 
value leads to better cross-CPU fairness at the potential cost of 
performance; on the other hand, the larger the value is, the closer the 
system behavior is to the default CFS without the patch.


Any comments and suggestions would be highly appreciated.

Thanks,

  tong

Signed-off-by: Tong Li <[EMAIL PROTECTED]>
---
 include/linux/sched.h |5
 kernel/sched.c | 577 ++
 kernel/sched_debug.c  |8
 kernel/sched_fair.c   |  124 --
 kernel/sysctl.c   |   13 +
 5 files changed, 604 insertions(+), 123 deletions(-)

diff -uprN linux-2.6.23-rc1/include/linux/sched.h 
linux-2.6.23-rc1-dwrr/include/linux/sched.h
--- linux-2.6.23-rc1/include/linux/sched.h  2007-07-22 18:42:28.0 
-0700
+++ linux-2.6.23-rc1-dwrr/include/linux/sched.h 2007-07-22 18:52:34.0 
-0700
@@ -887,7 +887,7 @@ struct sched_entity {
s64 fair_key;
struct load_weight  load;   /* for load-balancing */
struct rb_node  run_node;
-   unsigned inton_rq;
+   struct cfs_rq   *on_rq; /* NULL, active, or round-expired */

u64 wait_start_fair;
u64 wait_start;
@@ -906,6 +906,8 @@ struct sched_entity {
s64 sum_sleep_runtime;
unsigned long   wait_runtime_overruns;
unsigned long   wait_runtime_underruns;
+/* How long it has run in the current round. */
+u64 round_slice_used;
 #ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
@@ -1382,6 +1384,7 @@ extern unsigned int sysctl_sched_stat_gr
 extern unsigned int sysctl_sched_runtime_limit;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned int sysctl_sched_features;
+extern u64 sysctl_sched_base_round_slice;

 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
diff -uprN linux-2.6.23-rc1/kernel/sched.c linux-2.6.23-rc1-dwrr/kernel/sched.c
--- linux-2.6.23-rc1/kernel/sched.c 2007-07-22 18:42:28.0 -0700
+++ linux-2.6.23-rc1-dwrr/kernel/sched.c2007-07-22 23:05:54.0 
-0700
@@ -184,17 +184,20 @@ struct cfs_rq {
u64 exec_clock;
s64 wait_runtime;
u64 sleeper_bonus;
+u64 round;  /* round number of this cfs_rq */
unsigned long wait_runtime_overruns, wait_runtime_underruns;
+u64 wait_start_fair, wait_start; /* Used to track wait times of
+round-expired tasks. */

struct rb_root tasks_timeline;
struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
+   struct rq *rq;  /* cpu runqueue to which this cfs_rq is attached */
 #ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
 * It is set to NULL otherwise (i.e when none are currently running).
 */
struct sched_entity *curr;
-   struct rq *rq;  /* cpu runqueue to which this cfs_rq is attached */

/* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
 * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
@@ -239,7 +24