Re: [RFC] Extend Linux to support proportional-share scheduling

2007-06-06 Thread Bill Davidsen

Willy Tarreau wrote:

On Tue, Jun 05, 2007 at 09:31:33PM -0700, Li, Tong N wrote:
  

Willy,

These are all good comments. Regarding the cache penalty, I've done some
measurements using benchmarks like SPEC OMP on an 8-processor SMP and
the performance with this patch was nearly identical to that with the
mainline. I'm sure some apps may suffer from the potentially more
migrations with this design. In the end, I think what we want is to
balance fairness and performance. This design currently emphasizes on
fairness, but it could be changed to relax fairness when performance
does become an issue (which could even be a user-tunable knob depending
on which aspect the user cares more).



Maybe storing in each task a small list of the 2 or 4 last CPUs used would
help the scheduler in trying to place them. I mean, let's say you have 10
tasks and 8 CPUs. You first assign tasks 1..8 CPUs 1..8 for 1 timeslice.
Then you will give 9..10 a run on CPUs 1..2, and CPUs 3..8 will be usable
for other tasks. It wil be optimal to run tasks 3..8 on them. Then you will
stop some of those because they are "in advance", and run 9..10 and 1..2
again. You'll have to switch 1..2 to another group of CPUs to maintain hot
cache on CPUs 1..2 for tasks 9..10. But another possibility would be to
consider that 9..10 and 1..2 have performed the same amount of work, so
let's 9..10 take some advance and benefit from the hot cache, then try to
place 1..2 there again. But it will mean that 3..8 will now have run 2
timeslices more than others. At this moment, it should be wise to make
them sleep and keep their CPU history for future use.

Maybe on end-user systems, the CPUs history is not that important because
of the often small caches, but on high-end systems with large L2/L3 caches,
I think that we can often keep several tasks in the cache, justifying the
ability to select one of the last CPUs used.

  
CPU affinity to preserve cache is a very delicate balance. It makes 
sense to try to run a process on the same CPU, but since even a few ms 
of running some other process is long enough to refill the cache with 
new contents (depending on what it does, obviously) that long delays in 
running a process to get it on the "right CPU" are not always a saving, 
using the previous CPU becomes less beneficial rapidly.


Some omnipotent scheduler would have a count of pages evicted from cache 
as process A runs, and deduct that from the affinity of process B 
previously on the same CPU. Then make a perfect decision when it's 
better to migrate the task and how far. Since the schedulers now being 
advanced are "fair" rather than "perfect," everyone is making educated 
guesses on optimal process migration policy, migrating all threads to 
improve cache hit vs. spread them to better run threads in parallel, etc.


For a desktop I want a scheduler which "doesn't suck" at the things I do 
regularly. For a server I'm more concerned with overall tps than the 
latency of one transaction. Most users would trade a slowdown in kernel 
compiles for being able to watch youtube while the compile runs, and 
conversely people with heavily loaded servers would usually trade a 
slower transaction for more of them per second. Obviously within 
reason... what people will tolerate is a bounded value.

Not an easy thing to do, but probably very complementary to your work IMHO.
  

Agree, not easy at all.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Extend Linux to support proportional-share scheduling

2007-06-06 Thread Bill Davidsen

Willy Tarreau wrote:

On Tue, Jun 05, 2007 at 09:31:33PM -0700, Li, Tong N wrote:
  

Willy,

These are all good comments. Regarding the cache penalty, I've done some
measurements using benchmarks like SPEC OMP on an 8-processor SMP and
the performance with this patch was nearly identical to that with the
mainline. I'm sure some apps may suffer from the potentially more
migrations with this design. In the end, I think what we want is to
balance fairness and performance. This design currently emphasizes on
fairness, but it could be changed to relax fairness when performance
does become an issue (which could even be a user-tunable knob depending
on which aspect the user cares more).



Maybe storing in each task a small list of the 2 or 4 last CPUs used would
help the scheduler in trying to place them. I mean, let's say you have 10
tasks and 8 CPUs. You first assign tasks 1..8 CPUs 1..8 for 1 timeslice.
Then you will give 9..10 a run on CPUs 1..2, and CPUs 3..8 will be usable
for other tasks. It wil be optimal to run tasks 3..8 on them. Then you will
stop some of those because they are in advance, and run 9..10 and 1..2
again. You'll have to switch 1..2 to another group of CPUs to maintain hot
cache on CPUs 1..2 for tasks 9..10. But another possibility would be to
consider that 9..10 and 1..2 have performed the same amount of work, so
let's 9..10 take some advance and benefit from the hot cache, then try to
place 1..2 there again. But it will mean that 3..8 will now have run 2
timeslices more than others. At this moment, it should be wise to make
them sleep and keep their CPU history for future use.

Maybe on end-user systems, the CPUs history is not that important because
of the often small caches, but on high-end systems with large L2/L3 caches,
I think that we can often keep several tasks in the cache, justifying the
ability to select one of the last CPUs used.

  
CPU affinity to preserve cache is a very delicate balance. It makes 
sense to try to run a process on the same CPU, but since even a few ms 
of running some other process is long enough to refill the cache with 
new contents (depending on what it does, obviously) that long delays in 
running a process to get it on the right CPU are not always a saving, 
using the previous CPU becomes less beneficial rapidly.


Some omnipotent scheduler would have a count of pages evicted from cache 
as process A runs, and deduct that from the affinity of process B 
previously on the same CPU. Then make a perfect decision when it's 
better to migrate the task and how far. Since the schedulers now being 
advanced are fair rather than perfect, everyone is making educated 
guesses on optimal process migration policy, migrating all threads to 
improve cache hit vs. spread them to better run threads in parallel, etc.


For a desktop I want a scheduler which doesn't suck at the things I do 
regularly. For a server I'm more concerned with overall tps than the 
latency of one transaction. Most users would trade a slowdown in kernel 
compiles for being able to watch youtube while the compile runs, and 
conversely people with heavily loaded servers would usually trade a 
slower transaction for more of them per second. Obviously within 
reason... what people will tolerate is a bounded value.

Not an easy thing to do, but probably very complementary to your work IMHO.
  

Agree, not easy at all.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Extend Linux to support proportional-share scheduling

2007-06-05 Thread Willy Tarreau
On Tue, Jun 05, 2007 at 09:31:33PM -0700, Li, Tong N wrote:
> Willy,
> 
> These are all good comments. Regarding the cache penalty, I've done some
> measurements using benchmarks like SPEC OMP on an 8-processor SMP and
> the performance with this patch was nearly identical to that with the
> mainline. I'm sure some apps may suffer from the potentially more
> migrations with this design. In the end, I think what we want is to
> balance fairness and performance. This design currently emphasizes on
> fairness, but it could be changed to relax fairness when performance
> does become an issue (which could even be a user-tunable knob depending
> on which aspect the user cares more).

Maybe storing in each task a small list of the 2 or 4 last CPUs used would
help the scheduler in trying to place them. I mean, let's say you have 10
tasks and 8 CPUs. You first assign tasks 1..8 CPUs 1..8 for 1 timeslice.
Then you will give 9..10 a run on CPUs 1..2, and CPUs 3..8 will be usable
for other tasks. It wil be optimal to run tasks 3..8 on them. Then you will
stop some of those because they are "in advance", and run 9..10 and 1..2
again. You'll have to switch 1..2 to another group of CPUs to maintain hot
cache on CPUs 1..2 for tasks 9..10. But another possibility would be to
consider that 9..10 and 1..2 have performed the same amount of work, so
let's 9..10 take some advance and benefit from the hot cache, then try to
place 1..2 there again. But it will mean that 3..8 will now have run 2
timeslices more than others. At this moment, it should be wise to make
them sleep and keep their CPU history for future use.

Maybe on end-user systems, the CPUs history is not that important because
of the often small caches, but on high-end systems with large L2/L3 caches,
I think that we can often keep several tasks in the cache, justifying the
ability to select one of the last CPUs used.

Not an easy thing to do, but probably very complementary to your work IMHO.

Regards,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] Extend Linux to support proportional-share scheduling

2007-06-05 Thread Li, Tong N
Willy,

These are all good comments. Regarding the cache penalty, I've done some
measurements using benchmarks like SPEC OMP on an 8-processor SMP and
the performance with this patch was nearly identical to that with the
mainline. I'm sure some apps may suffer from the potentially more
migrations with this design. In the end, I think what we want is to
balance fairness and performance. This design currently emphasizes on
fairness, but it could be changed to relax fairness when performance
does become an issue (which could even be a user-tunable knob depending
on which aspect the user cares more).

Thanks,

  tong

> -Original Message-
> From: Willy Tarreau [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, June 05, 2007 8:33 PM
> To: Li, Tong N
> Cc: linux-kernel@vger.kernel.org; Ingo Molnar; Con Kolivas; Linus
Torvalds;
> Arjan van de Ven; Siddha, Suresh B; Barnes, Jesse; William Lee Irwin
III;
> Bill Huey (hui); [EMAIL PROTECTED]; [EMAIL PROTECTED]; Nick Piggin;
Bill
> Davidsen; John Kingman; Peter Williams; [EMAIL PROTECTED]
> Subject: Re: [RFC] Extend Linux to support proportional-share
scheduling
> 
> Hi Tong,
> 
> On Tue, Jun 05, 2007 at 06:56:17PM -0700, Li, Tong N wrote:
> > Hi all,
> >
> > I've ported my code to mainline 2.6.21.3. You can get it at
> > http://www.cs.duke.edu/~tongli/linux/.
> 
> as much as possible, you should post your patch for others to comment
> on it. Posting just a URL is often fine to inform people that there's
> an update to *try*, but at this stage, it may be more important to
> comment on your design and code than trying it.
> 
> [...]
> 
> > Trio has two unique features: (1) it enables users to control shares
of
> > CPU time for any thread or group of threads (e.g., a process, an
> > application, etc.), and (2) it enables fair sharing of CPU time
across
> > multiple CPUs. For example, with ten tasks running on eight CPUs,
Trio
> > allows each task to take an equal fraction of the total CPU time,
> 
> While this looks interesting, doesn't it make threads jump to random
> CPUs all the time, thus reducing cache efficiency ? Or maybe it would
> be good to consider two or three criteria to group CPUs :
>   - those which share the same caches (multi-core)
>   - those which share the same local memory on the same mainboard
> (multi-socket)
>   - those which are so far away from each others that it's really
> not worth migrating a task
> 
> > whereas no existing scheduler achieves such fairness. These features
> > enable Trio to complement the mainline scheduler and other proposals
> > such as CFS and SD to enable greater user flexibility and stronger
> > fairness.
> 
> Right now, I think that only benchmarks could tell which design is
> better. I understand that running 10 tasks on 8 CPUs may result in
> the last batch involving only 2 CPUs with 1 task each, thus increasing
> the overall wall time. But maybe cache thrashing between CPUs will
> also increase the wall time.
> 
> Regards,
> Willy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Extend Linux to support proportional-share scheduling

2007-06-05 Thread Willy Tarreau
Hi Tong,

On Tue, Jun 05, 2007 at 06:56:17PM -0700, Li, Tong N wrote:
> Hi all,
> 
> I've ported my code to mainline 2.6.21.3. You can get it at
> http://www.cs.duke.edu/~tongli/linux/.

as much as possible, you should post your patch for others to comment
on it. Posting just a URL is often fine to inform people that there's
an update to *try*, but at this stage, it may be more important to
comment on your design and code than trying it.

[...]

> Trio has two unique features: (1) it enables users to control shares of
> CPU time for any thread or group of threads (e.g., a process, an
> application, etc.), and (2) it enables fair sharing of CPU time across
> multiple CPUs. For example, with ten tasks running on eight CPUs, Trio
> allows each task to take an equal fraction of the total CPU time,

While this looks interesting, doesn't it make threads jump to random
CPUs all the time, thus reducing cache efficiency ? Or maybe it would
be good to consider two or three criteria to group CPUs :
  - those which share the same caches (multi-core)
  - those which share the same local memory on the same mainboard
(multi-socket)
  - those which are so far away from each others that it's really
not worth migrating a task

> whereas no existing scheduler achieves such fairness. These features
> enable Trio to complement the mainline scheduler and other proposals
> such as CFS and SD to enable greater user flexibility and stronger
> fairness.

Right now, I think that only benchmarks could tell which design is
better. I understand that running 10 tasks on 8 CPUs may result in
the last batch involving only 2 CPUs with 1 task each, thus increasing
the overall wall time. But maybe cache thrashing between CPUs will
also increase the wall time.

Regards,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Extend Linux to support proportional-share scheduling

2007-06-05 Thread Li, Tong N
Hi all,

I've ported my code to mainline 2.6.21.3. You can get it at
http://www.cs.duke.edu/~tongli/linux/. As I said before, the intent of
the patch is not to compete with CFS and SD because the design relies on
the underlying scheduler for interactive performance. The goal here is
to present a complementary design that can ensure stronger MP fairness,
which I think is lacking in the existing proposals. Here's a brief
overview of the design (I call it Trio for the lack of a better name).
Any comments or suggestions will be highly appreciated.

Trio extends the existing Linux scheduler with support for
proportional-share scheduling. It uses a scheduling algorithm, called
Distributed Weighted Round-Robin (DWRR), which retains the existing
scheduler design as much as possible, and extends it to achieve
proportional fairness with O(1) time complexity and a constant error
bound, compared to the ideal fair scheduling algorithm. The goal of Trio
is not to improve interactive performance; rather, it relies on the
existing scheduler for interactivity and extends it to support MP
proportional fairness.

Trio has two unique features: (1) it enables users to control shares of
CPU time for any thread or group of threads (e.g., a process, an
application, etc.), and (2) it enables fair sharing of CPU time across
multiple CPUs. For example, with ten tasks running on eight CPUs, Trio
allows each task to take an equal fraction of the total CPU time,
whereas no existing scheduler achieves such fairness. These features
enable Trio to complement the mainline scheduler and other proposals
such as CFS and SD to enable greater user flexibility and stronger
fairness.

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Extend Linux to support proportional-share scheduling

2007-06-05 Thread Li, Tong N
Hi all,

I've ported my code to mainline 2.6.21.3. You can get it at
http://www.cs.duke.edu/~tongli/linux/. As I said before, the intent of
the patch is not to compete with CFS and SD because the design relies on
the underlying scheduler for interactive performance. The goal here is
to present a complementary design that can ensure stronger MP fairness,
which I think is lacking in the existing proposals. Here's a brief
overview of the design (I call it Trio for the lack of a better name).
Any comments or suggestions will be highly appreciated.

Trio extends the existing Linux scheduler with support for
proportional-share scheduling. It uses a scheduling algorithm, called
Distributed Weighted Round-Robin (DWRR), which retains the existing
scheduler design as much as possible, and extends it to achieve
proportional fairness with O(1) time complexity and a constant error
bound, compared to the ideal fair scheduling algorithm. The goal of Trio
is not to improve interactive performance; rather, it relies on the
existing scheduler for interactivity and extends it to support MP
proportional fairness.

Trio has two unique features: (1) it enables users to control shares of
CPU time for any thread or group of threads (e.g., a process, an
application, etc.), and (2) it enables fair sharing of CPU time across
multiple CPUs. For example, with ten tasks running on eight CPUs, Trio
allows each task to take an equal fraction of the total CPU time,
whereas no existing scheduler achieves such fairness. These features
enable Trio to complement the mainline scheduler and other proposals
such as CFS and SD to enable greater user flexibility and stronger
fairness.

  tong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Extend Linux to support proportional-share scheduling

2007-06-05 Thread Willy Tarreau
Hi Tong,

On Tue, Jun 05, 2007 at 06:56:17PM -0700, Li, Tong N wrote:
 Hi all,
 
 I've ported my code to mainline 2.6.21.3. You can get it at
 http://www.cs.duke.edu/~tongli/linux/.

as much as possible, you should post your patch for others to comment
on it. Posting just a URL is often fine to inform people that there's
an update to *try*, but at this stage, it may be more important to
comment on your design and code than trying it.

[...]

 Trio has two unique features: (1) it enables users to control shares of
 CPU time for any thread or group of threads (e.g., a process, an
 application, etc.), and (2) it enables fair sharing of CPU time across
 multiple CPUs. For example, with ten tasks running on eight CPUs, Trio
 allows each task to take an equal fraction of the total CPU time,

While this looks interesting, doesn't it make threads jump to random
CPUs all the time, thus reducing cache efficiency ? Or maybe it would
be good to consider two or three criteria to group CPUs :
  - those which share the same caches (multi-core)
  - those which share the same local memory on the same mainboard
(multi-socket)
  - those which are so far away from each others that it's really
not worth migrating a task

 whereas no existing scheduler achieves such fairness. These features
 enable Trio to complement the mainline scheduler and other proposals
 such as CFS and SD to enable greater user flexibility and stronger
 fairness.

Right now, I think that only benchmarks could tell which design is
better. I understand that running 10 tasks on 8 CPUs may result in
the last batch involving only 2 CPUs with 1 task each, thus increasing
the overall wall time. But maybe cache thrashing between CPUs will
also increase the wall time.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] Extend Linux to support proportional-share scheduling

2007-06-05 Thread Li, Tong N
Willy,

These are all good comments. Regarding the cache penalty, I've done some
measurements using benchmarks like SPEC OMP on an 8-processor SMP and
the performance with this patch was nearly identical to that with the
mainline. I'm sure some apps may suffer from the potentially more
migrations with this design. In the end, I think what we want is to
balance fairness and performance. This design currently emphasizes on
fairness, but it could be changed to relax fairness when performance
does become an issue (which could even be a user-tunable knob depending
on which aspect the user cares more).

Thanks,

  tong

 -Original Message-
 From: Willy Tarreau [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, June 05, 2007 8:33 PM
 To: Li, Tong N
 Cc: linux-kernel@vger.kernel.org; Ingo Molnar; Con Kolivas; Linus
Torvalds;
 Arjan van de Ven; Siddha, Suresh B; Barnes, Jesse; William Lee Irwin
III;
 Bill Huey (hui); [EMAIL PROTECTED]; [EMAIL PROTECTED]; Nick Piggin;
Bill
 Davidsen; John Kingman; Peter Williams; [EMAIL PROTECTED]
 Subject: Re: [RFC] Extend Linux to support proportional-share
scheduling
 
 Hi Tong,
 
 On Tue, Jun 05, 2007 at 06:56:17PM -0700, Li, Tong N wrote:
  Hi all,
 
  I've ported my code to mainline 2.6.21.3. You can get it at
  http://www.cs.duke.edu/~tongli/linux/.
 
 as much as possible, you should post your patch for others to comment
 on it. Posting just a URL is often fine to inform people that there's
 an update to *try*, but at this stage, it may be more important to
 comment on your design and code than trying it.
 
 [...]
 
  Trio has two unique features: (1) it enables users to control shares
of
  CPU time for any thread or group of threads (e.g., a process, an
  application, etc.), and (2) it enables fair sharing of CPU time
across
  multiple CPUs. For example, with ten tasks running on eight CPUs,
Trio
  allows each task to take an equal fraction of the total CPU time,
 
 While this looks interesting, doesn't it make threads jump to random
 CPUs all the time, thus reducing cache efficiency ? Or maybe it would
 be good to consider two or three criteria to group CPUs :
   - those which share the same caches (multi-core)
   - those which share the same local memory on the same mainboard
 (multi-socket)
   - those which are so far away from each others that it's really
 not worth migrating a task
 
  whereas no existing scheduler achieves such fairness. These features
  enable Trio to complement the mainline scheduler and other proposals
  such as CFS and SD to enable greater user flexibility and stronger
  fairness.
 
 Right now, I think that only benchmarks could tell which design is
 better. I understand that running 10 tasks on 8 CPUs may result in
 the last batch involving only 2 CPUs with 1 task each, thus increasing
 the overall wall time. But maybe cache thrashing between CPUs will
 also increase the wall time.
 
 Regards,
 Willy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Extend Linux to support proportional-share scheduling

2007-06-05 Thread Willy Tarreau
On Tue, Jun 05, 2007 at 09:31:33PM -0700, Li, Tong N wrote:
 Willy,
 
 These are all good comments. Regarding the cache penalty, I've done some
 measurements using benchmarks like SPEC OMP on an 8-processor SMP and
 the performance with this patch was nearly identical to that with the
 mainline. I'm sure some apps may suffer from the potentially more
 migrations with this design. In the end, I think what we want is to
 balance fairness and performance. This design currently emphasizes on
 fairness, but it could be changed to relax fairness when performance
 does become an issue (which could even be a user-tunable knob depending
 on which aspect the user cares more).

Maybe storing in each task a small list of the 2 or 4 last CPUs used would
help the scheduler in trying to place them. I mean, let's say you have 10
tasks and 8 CPUs. You first assign tasks 1..8 CPUs 1..8 for 1 timeslice.
Then you will give 9..10 a run on CPUs 1..2, and CPUs 3..8 will be usable
for other tasks. It wil be optimal to run tasks 3..8 on them. Then you will
stop some of those because they are in advance, and run 9..10 and 1..2
again. You'll have to switch 1..2 to another group of CPUs to maintain hot
cache on CPUs 1..2 for tasks 9..10. But another possibility would be to
consider that 9..10 and 1..2 have performed the same amount of work, so
let's 9..10 take some advance and benefit from the hot cache, then try to
place 1..2 there again. But it will mean that 3..8 will now have run 2
timeslices more than others. At this moment, it should be wise to make
them sleep and keep their CPU history for future use.

Maybe on end-user systems, the CPUs history is not that important because
of the often small caches, but on high-end systems with large L2/L3 caches,
I think that we can often keep several tasks in the cache, justifying the
ability to select one of the last CPUs used.

Not an easy thing to do, but probably very complementary to your work IMHO.

Regards,
Willy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Extend Linux to support proportional-share scheduling

2007-04-20 Thread William Lee Irwin III
On Fri, Apr 20, 2007 at 11:30:04AM -0700, Tong Li wrote:
> This patch extends the existing Linux scheduler with support for
> proportional-share scheduling (as a new KConfig option).
> http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch
> It uses a scheduling algorithm, called Distributed Weighted Round-Robin 
> (DWRR), which retains the existing scheduler design as much as possible, 
> and extends it to achieve proportional fairness with O(1) time complexity 
> and a constant error bound, compared to the ideal fair scheduling 
> algorithm. The code is by no means final and has been only tested on a 
> four-processor dual-core x86-64 system. Rather than focusing on coding 
> issues, the intent of this RFC is to invite discussions on the proposed 
> DWRR algorithm and proportional-share scheduling in general.

Very nice. I think we need this kind of functionality in mainline.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Extend Linux to support proportional-share scheduling

2007-04-20 Thread Tong Li

This patch extends the existing Linux scheduler with support for
proportional-share scheduling (as a new KConfig option).

http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch

It uses a scheduling algorithm, called Distributed Weighted Round-Robin 
(DWRR), which retains the existing scheduler design as much as possible, 
and extends it to achieve proportional fairness with O(1) time complexity 
and a constant error bound, compared to the ideal fair scheduling 
algorithm. The code is by no means final and has been only tested on a 
four-processor dual-core x86-64 system. Rather than focusing on coding 
issues, the intent of this RFC is to invite discussions on the proposed 
DWRR algorithm and proportional-share scheduling in general.


Background:

Over the years, there has been a lot of criticism that conventional Unix 
priorities and the nice interface provide insufficient support for users 
to accurately control CPU shares of different threads or applications. 
Many have studied scheduling algorithms that achieve proportional 
fairness. Assuming that each thread has a weight that expresses its 
desired CPU share, informally, a scheduler achieves ideal proportional 
fairness if (1) it is work-conserving, and (2) it allocates CPU time to 
threads in exact proportion to their weights in any time interval. Ideal 
proportional fairness is impossible since it requires that all runnable 
threads be running simultaneously and scheduled with infinitesimally small 
quanta. In practice, every proportional-share scheduling algorithm 
approximates the ideal algorithm with the goal of achieving a constant 
error bound. For more theoretical background, please refer to the 
following papers:


[1] A. K. Parekh and R. G. Gallager. A generalized processor sharing 
approach to flow control in integrated services networks: The single-node 
case. IEEE/ACM Transactions on Networking, 1(3):344-357, June 1993.


[2] C. R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fair 
queueing. In Proceedings of IEEE INFOCOM '94, pages 120-128, Mar. 1996.


Previous proportional-share scheduling algorithms, however, suffer one or 
more of the following problems:


(1) Inaccurate fairness with non-constant error bounds;
(2) High run-time overhead (e.g., logarithmic);
(3) Poor scalability due to the use of global run queues;
(4) Inefficient support for latency-sensitive applications.

Since the Linux scheduler has been successful at avoiding problems 2 to 4, 
this design attempts to extend it with support for accurate proportional 
fairness while retaining all of its existing benefits.


User Interface:

By default, each thread is assigned a weight proportional to its static 
priority. A set of system calls also allow users to specify a weight or 
reservation for any thread. Weights are relative. For example, for two 
threads with weights 3 and 1, the scheduler ensures that the ratio of 
their CPU time is 3:1. Reservations are absolute and in the form of X% of 
the total CPU time. For example, a reservation of 80% for a thread means 
that the thread always receives at least 80% of the total CPU time 
regardless of other threads.


The system calls also support specifying weights or reservations for groups of
threads. For example, one can specify an 80% reservation for a group of
threads (e.g., a process) to control the total CPU share to which the member
threads are collectively entitled.  Within the group, the user can further
specify local weights to different threads to control their relative shares.

Scheduling Algorithm:

The scheduler keeps a set data structures, called Trio groups, to maintain 
the weight or reservation of each thread group (including one or more 
threads) and the local weight of each member thread. When scheduling a 
thread, it consults these data structures and computes (in constant time) 
a system-wide weight for the thread that represents an equivalent CPU 
share. Consequently, the scheduling algorithm, DWRR, operates solely based 
on the system-wide weight (or weight for short, hereafter) of each thread.


For each processor, besides the existing active and expired arrays, DWRR 
keeps one more array, called round-expired. It also keeps a round number 
for each processor, initially all zero. A thread is said to be in round R 
if it is in the active or expired array of a round-R processor. For each 
thread, DWRR associates it with a round slice, equal to its weight 
multiplied by a scaling factor, which controls the total time that the 
thread can run in any round. When a thread exhausts its time slice, as in 
the existing scheduler, DWRR moves it to the expired array. However, when 
it exhausts its round slice, DWRR moves it to the round-expired array, 
indicating that the thread has finished round R. In this way, all threads 
in the active and expired arrays on a round-R processor are running in 
round R, while the threads in the round-expired array have finished round 
R and are awaiting to 

[RFC] Extend Linux to support proportional-share scheduling

2007-04-20 Thread Tong Li

This patch extends the existing Linux scheduler with support for
proportional-share scheduling (as a new KConfig option).

http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch

It uses a scheduling algorithm, called Distributed Weighted Round-Robin 
(DWRR), which retains the existing scheduler design as much as possible, 
and extends it to achieve proportional fairness with O(1) time complexity 
and a constant error bound, compared to the ideal fair scheduling 
algorithm. The code is by no means final and has been only tested on a 
four-processor dual-core x86-64 system. Rather than focusing on coding 
issues, the intent of this RFC is to invite discussions on the proposed 
DWRR algorithm and proportional-share scheduling in general.


Background:

Over the years, there has been a lot of criticism that conventional Unix 
priorities and the nice interface provide insufficient support for users 
to accurately control CPU shares of different threads or applications. 
Many have studied scheduling algorithms that achieve proportional 
fairness. Assuming that each thread has a weight that expresses its 
desired CPU share, informally, a scheduler achieves ideal proportional 
fairness if (1) it is work-conserving, and (2) it allocates CPU time to 
threads in exact proportion to their weights in any time interval. Ideal 
proportional fairness is impossible since it requires that all runnable 
threads be running simultaneously and scheduled with infinitesimally small 
quanta. In practice, every proportional-share scheduling algorithm 
approximates the ideal algorithm with the goal of achieving a constant 
error bound. For more theoretical background, please refer to the 
following papers:


[1] A. K. Parekh and R. G. Gallager. A generalized processor sharing 
approach to flow control in integrated services networks: The single-node 
case. IEEE/ACM Transactions on Networking, 1(3):344-357, June 1993.


[2] C. R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fair 
queueing. In Proceedings of IEEE INFOCOM '94, pages 120-128, Mar. 1996.


Previous proportional-share scheduling algorithms, however, suffer one or 
more of the following problems:


(1) Inaccurate fairness with non-constant error bounds;
(2) High run-time overhead (e.g., logarithmic);
(3) Poor scalability due to the use of global run queues;
(4) Inefficient support for latency-sensitive applications.

Since the Linux scheduler has been successful at avoiding problems 2 to 4, 
this design attempts to extend it with support for accurate proportional 
fairness while retaining all of its existing benefits.


User Interface:

By default, each thread is assigned a weight proportional to its static 
priority. A set of system calls also allow users to specify a weight or 
reservation for any thread. Weights are relative. For example, for two 
threads with weights 3 and 1, the scheduler ensures that the ratio of 
their CPU time is 3:1. Reservations are absolute and in the form of X% of 
the total CPU time. For example, a reservation of 80% for a thread means 
that the thread always receives at least 80% of the total CPU time 
regardless of other threads.


The system calls also support specifying weights or reservations for groups of
threads. For example, one can specify an 80% reservation for a group of
threads (e.g., a process) to control the total CPU share to which the member
threads are collectively entitled.  Within the group, the user can further
specify local weights to different threads to control their relative shares.

Scheduling Algorithm:

The scheduler keeps a set data structures, called Trio groups, to maintain 
the weight or reservation of each thread group (including one or more 
threads) and the local weight of each member thread. When scheduling a 
thread, it consults these data structures and computes (in constant time) 
a system-wide weight for the thread that represents an equivalent CPU 
share. Consequently, the scheduling algorithm, DWRR, operates solely based 
on the system-wide weight (or weight for short, hereafter) of each thread.


For each processor, besides the existing active and expired arrays, DWRR 
keeps one more array, called round-expired. It also keeps a round number 
for each processor, initially all zero. A thread is said to be in round R 
if it is in the active or expired array of a round-R processor. For each 
thread, DWRR associates it with a round slice, equal to its weight 
multiplied by a scaling factor, which controls the total time that the 
thread can run in any round. When a thread exhausts its time slice, as in 
the existing scheduler, DWRR moves it to the expired array. However, when 
it exhausts its round slice, DWRR moves it to the round-expired array, 
indicating that the thread has finished round R. In this way, all threads 
in the active and expired arrays on a round-R processor are running in 
round R, while the threads in the round-expired array have finished round 
R and are awaiting to 

Re: [RFC] Extend Linux to support proportional-share scheduling

2007-04-20 Thread William Lee Irwin III
On Fri, Apr 20, 2007 at 11:30:04AM -0700, Tong Li wrote:
 This patch extends the existing Linux scheduler with support for
 proportional-share scheduling (as a new KConfig option).
 http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch
 It uses a scheduling algorithm, called Distributed Weighted Round-Robin 
 (DWRR), which retains the existing scheduler design as much as possible, 
 and extends it to achieve proportional fairness with O(1) time complexity 
 and a constant error bound, compared to the ideal fair scheduling 
 algorithm. The code is by no means final and has been only tested on a 
 four-processor dual-core x86-64 system. Rather than focusing on coding 
 issues, the intent of this RFC is to invite discussions on the proposed 
 DWRR algorithm and proportional-share scheduling in general.

Very nice. I think we need this kind of functionality in mainline.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/