Re: wake_up vs. wake_up_sync

2001-06-27 Thread Hubertus Franke



Manfred,

Calling this a BUG is misleading. It is ok to be occasionally wrong
regarding the preemption priority as long as RT tasks are not involved.
This is due to the fact that PROC_CHANGE_PENALTIES are used, which already
provide for some priority inversion.

Hubertus Franke
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Manfred Spraul <[EMAIL PROTECTED]>@vger.kernel.org on 06/27/2001
06:41:29 PM

Sent by:  [EMAIL PROTECTED]


To:   Mike Kravetz <[EMAIL PROTECTED]>
cc:   Scott Long <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
Subject:  Re: wake_up vs. wake_up_sync



Mike Kravetz wrote:
>
> On Wed, Jun 27, 2001 at 11:22:19PM +0200, Manfred Spraul wrote:
> > > Why would you want to prevent
> > > reschedule_idle()?
> > >
> > If one process runs, wakes up another process and _knows_ that it's
> > going to sleep immediately after the wake_up it doesn't need the
> > reschedule_idle: the current cpu will be idle soon, the scheduler
> > doesn't need to find another cpu for the woken up thread.
>
> I'm curious.  How does the caller of wake_up_sync know that the
> current cpu will soon be idle.  Does it assume that there are no
> other tasks on the runqueue waiting for a CPU?  If there are other
> tasks on the runqueue, isn't it possible that another task has a
> higher goodness value than the task being awakened.  In such a case,
> isn't is possible that the awakened task could sit on the runqueue
> (waiting for a CPU) while tasks with a lower goodness value are
> allowed to run?
>

I found one combination where that could happen:

process.thread
A.1: highest priority, runs on cpu0
B.1: lowest priority, runs on cpu1
A.2: another thread of process A, priority
PROC_CHANGE_PENALTY+PRIORITY(B.1)+1, sleeping.
B.2: same priority as A.2, sleeping, same process as B.1

A.1:
{
 wake_up("A.2");
/* nothing happens: preemption_goodness is 0 since B.1 has both
PROC_CHANGE_PENALTY and the += 1 from 'same mm'
*/
 wake_up_sync("B.2");
 schedule();
/* schedule selects A.2 instead of B.2 due to the += 1 from 'same mm'.
BUG: B.2 should replace B.1 on cpu1. The preemption_goodness is 1.
*/

IMHO obscure and very rare.

But I just found a bigger problem:
If wake_up_sync wakes up more than 1 process then cpus could remain in
cpu_idle() although processes are on the runqueue without cpus.

--
 Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: threading question

2001-06-13 Thread Hubertus Franke



>I got that response too. When I pressed kernel people for details it turns
>out that they think having hundreds of runnable threads/processes (mostly
>the same thing under Linux) is wasteful. The scheduler is just not
optimised
>for that.

Try out the http://lse.sourceforge.net/scheduling  patches. The MQ kernel
scheduler sure can handle this
kind of load :-)

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



bert hubert <[EMAIL PROTECTED]>@vger.kernel.org on 06/13/2001 01:31:39 PM

Sent by:  [EMAIL PROTECTED]


To:   [EMAIL PROTECTED]
cc:
Subject:  Re: threading question



On Tue, Jun 12, 2001 at 12:06:40PM -0700, Kip Macy wrote:
> This may sound like flamebait, but its not. Linux threads are basically
> just processes that share the same address space. Their performance is
> measurably worse than it is on most commercial Unixes and FreeBSD.

Thread creation may be a bit slow. But the kludges to provide posix threads
completely from userspace also hurt. Notably, they do not scale over
multiple CPUs.

> They are not, or at least two years ago, were not POSIX compliant
> (they behaved badly with respect to signals). The impoverished

POSIX threads are silly with respect to signals. I do almost all my
programming these days with pthreads and I find that I really do not miss
signals at all.

> from Larry McVoy's home page attributed to Alan Cox illustrates this
> reasonably well: "A computer is a state machine. Threads are for people
> who can't program state machines." Sorry for not being more helpful.

I got that response too. When I pressed kernel people for details it turns
out that they think having hundreds of runnable threads/processes (mostly
the same thing under Linux) is wasteful. The scheduler is just not
optimised
for that.

Regards,

bert

--
http://www.PowerDNS.com  Versatile DNS Services
Trilab   The Technology People
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4 sluggish under fork load

2001-05-03 Thread Hubertus Franke


I pointed that out to the folk who proposed this and
gave him a fix that ensures that the child has at least a value of 2
higher.

Given the child all and the parent nothing is TOTAL BOGUS. The parent
essentially has to wait for a recalculate.
This so-called fix has to go in the next release.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



"Adam J. Richter" <[EMAIL PROTECTED]>@vger.kernel.org on 05/01/2001
12:18:10 AM

Sent by:  [EMAIL PROTECTED]


To:   [EMAIL PROTECTED]
cc:   [EMAIL PROTECTED]
Subject:  Re: 2.4.4 sluggish under fork load



>The fact that 2.4.4 gives the whole timeslice to the child
>is just bogus to begin with.

I only did that because I could not find another way
to make the child run first that worked in practice.  I tried
other things before that.  Since Peter Osterlund's SCHED_YIELD
thing works, we no longer have to give all of the CPU to the
child.  The scheduler time slices are currently enormous, so as
long as the child gets even one clock tick before the parent runs,
it should reach the exec() if that is its plan.  1 tick = 10ms = 10
million cycles on a 1GHz CPU, which should be enough time to encrypt
my /boot/vmlinux in twofish if it's in RAM.

Adam J. Richter __ __   4880 Stevens Creek Blvd, Suite
104
[EMAIL PROTECTED] \ /  San Jose, California 95129-1034
+1 408 261-6630 | g g d r a s i l   United States of America
fax +1 408 261-6631  "Free Software For The Rest Of Us."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PATCH(?): linux-2.4.4-pre2: fork should run child first

2001-04-13 Thread Hubertus Franke



First you are wrong by assuming that setting current->counter=0
will guarantee that the child runs first. In an SMP it only means
that this might initiate another recalculate and could run from
there in parallel with the child.

Actually quickly looking at the source code here.
You don't have to call schedule() at all.
A bit further down wake_up_process(p) is called, which in turn
calls reschedule_idle(p). Hence we don't have to call schedule.

If one satisfies the conditions:

(preemption_goodness(current,p,p->processor) > 1) then the child should
run.
[ with (child==p) and (parent==current) ].

This is for the uniprocessor system. In the SMP both could continue
to run. Looking at goodness computation, since p->mm == current->mm,
p->nice == current->nice and p->processor == current->processor,
all what matters is the difference in the counter values.
My proposed patch always yield 1, which ofcourse doesn't have the desired
effect.


Here is a patch that always yields a diff of 2. However for odd number of
current->counter it looses a token between the two.

 {
  long parcnt = current->counter;
  p->counter = (parcnt+((parcnt&1)?1:2)) >> 1;
  parcnt >>= 1;
  if (parcnt>0) {
   current->counter = 0;
   current->need_resched = 1;
  } else {
   current->counter = parcnt - 1;
 }


There is the other view that I should not loose a token.
In that case the following code will add a token in the odd counter case.
I think that this is preferrable over the first solution.


 p->counter = (current->counter+3)>>1;
 current->counter = (current->counter >> 1) - 1;
 if (current->counter <= 0) {
  current->counter = 0;
  current->need_resched = 1;
 }



Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



"Adam J. Richter" <[EMAIL PROTECTED]> on 04/12/2001 08:42:12 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:
Subject:  Re: PATCH(?): linux-2.4.4-pre2: fork should run child first



>p->counter = (current->counter + 1) >> 1;
>current->counter = (current->counter - 1) >> 1;
>schedule();

 I don't have time to try this right now and I'm not sure
what locks are held at that point in the code or whether schedule()
will actually schedule a different process if the current one
has current->counter > 0 (even if current->need_resched is set and
even if another process has a higher proc->counter value).

 Even if it did work, your code is more complex and makes it
less likely that the child will reach exec() before the parent
runs again.  So, I am not sure I would see the advantage if it did work.

Adam J. Richter __ __   4880 Stevens Creek Blvd, Suite
104
[EMAIL PROTECTED] \ /  San Jose, California 95129-1034
+1 408 261-6630 | g g d r a s i l   United States of America
fax +1 408 261-6631  "Free Software For The Rest Of Us."



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PATCH(?): linux-2.4.4-pre2: fork should run child first

2001-04-12 Thread Hubertus Franke



First the 2.4.3 tries to prefer the child.
Only does it in half of the cases though (odd counter numbers).

Your patch seems a bit radical for what you want to do.
Taking away all tokens from the parents will require that it has to wait
until the next recalculate loop.
Since (p) and (current) share the same  and the same 
why not simply make sure that the (p->counter) > (current->counter).
If you are concerned that a tick is not enough to fire off the
exec then make it a predefined constant.

Try this ... this will guarantee that (p->counter) > (current->counter)
and it seems not as radical

 p->counter = (current->counter + 1) >> 1;
current->counter = (current->counter - 1) >> 1;
if (!current->counter)
current->need_resched = 1;


instead of this


-   p->counter = (current->counter + 1) >> 1;
-   current->counter >>= 1;
-   if (!current->counter)
-   current->need_resched = 1;
+   p->counter = current->counter;
+   current->counter = 0;
+   current->need_resched = 1;

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



"Adam J. Richter" <[EMAIL PROTECTED]>@vger.kernel.org on 04/12/2001
04:55:16 AM

Sent by:  [EMAIL PROTECTED]


To:   [EMAIL PROTECTED], [EMAIL PROTECTED]
cc:
Subject:  PATCH(?): linux-2.4.4-pre2: fork should run child first



 I remember sometime in the late 80's a fellow at UniSoft
named Don whose last name escapes me just now told me about a
paper presented at a Usenix symposium that had some measurements
that purported that copy-on-write was a performance lose and
better performance would be achieve by having fork() just copy
all of the writable pages of the parent process.

 It turned out that the particular unix-like system on which
these benchmarks were taken had a version of fork that did not run
the child first.  As it was explained to me then, most of the time,
the child process from a fork will do just a few things and then do
an exec(), releasing its copy-on-write references to the parent's
pages, and that is the big win of copy-on-write for fork() in practice.
This oversight was considered a big embarassment for the operating
system in question, so I won't name it here.

 Guess why you're seeing this email.  That's right.  Linux-2.4.3's
fork() does not run the child first.  Consequently, the parent will
probably generate unnecessary copy-on-write page copies until it burns
through its remaining clock ticks (any COW's that the child causes will
basically happen no matter what the order of execution is) or calls
wait() (and while the wait is blocking, the parent's CPU priority will
decay as the scheduler periodically recalculates process priorities, so
that bit of dynamic priority has probably not been allocated where the
user will be able to use it, if we want to look at "fairness" in such
detail).

 I suppose that running the child first also has a minor
advantage for clone() in that it should make programs that spawn lots
of threads to do little bits of work behave better on machines with a
small number of processors, since the threads that do so little work that
they accomplish they finish within their time slice will not pile up
before they have a chance to run.  So, rather than give the parent's CPU
priority to the child only if CLONE_VFORK is not set, I have decided to
do a bit of machete surgery and have the child always inherit all of the
parent's CPU priority all of the time.  It simplifies the code and
probably saves a few clock cycles (and before you say that this will
cost a context switch, consider that the child will almost always run
at least one time slice anyhow).

 I have attached the patch below.  I have also adjusted the
comment describing the code.  Please let me know if this hand waving
explanation is sufficient.  I'm trying to be lazy on not do a measurement
project to justify this relatively simple change.  However, I do know, from
a simple test program ("printf ("%d", fork());"), that this patch has
the intended effect of running the child first.

--
Adam J. Richter __ __   4880 Stevens Creek Blvd, Suite
104
[EMAIL PROTECTED] \ /  San Jose, California 95129-1034
+1 408 261-6630 | g g d r a s i l   United States of America
fax +1 408 261-6631  "Free Software For The Rest Of Us."






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Bug in sys_sched_yield

2001-04-11 Thread Hubertus Franke



In the recent optimizations to sys_sched_yield a bug was introduced.
In the current implementation of sys_sched_yield()
the aligned_data and idle_tasks are indexed by logical cpu-#.

They should however be indexed by physical cpu-#.
Since logical==physical on the x86 platform, it doesn't matter there,
for other platforms where this is not true it will matter.
Below is the fix.


diff -uwrbBN linux-2.4.3/kernel/sched.c linux-2.4.3-fix/kernel/sched.c
--- linux-2.4.3/kernel/sched.c Thu Mar 22 12:20:45 2001
+++ linux-2.4.3-fix/kernel/sched.c Wed Apr 11 11:27:16 2001
@@ -1024,9 +1024,11 @@
 int i;

 // Substract non-idle processes running on other CPUs.
-for (i = 0; i < smp_num_cpus; i++)
- if (aligned_data[i].schedule_data.curr != idle_task(i))
+for (i = 0; i < smp_num_cpus; i++) {
+ int cpu = cpu_logical_map(i);
+ if (aligned_data[cpu].schedule_data.curr != idle_task(cpu))
   nr_pending--;
+}
 #else
 // on UP this process is on the runqueue as well
 nr_pending--;

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: a quest for a better scheduler

2001-04-06 Thread Hubertus Franke




Torrey, nothing to worry about (or at least not yet).

The new scheduler only replaces the current SMP scheduler, not the single
cpu scheduler. So the devices that you refer to are not affected at
all by these changes.

The desire for interactivity etc, lead us to stick to the current
proposed MQ scheduler semantics, namely not to completely isolate the
runqueues from each other (e.g. the HP-MQ does so), but do some cross
checking to ensure that high priority tasks are run when they need to.

First you get the same (similar good or bad scheduler semantics as the
current one) but at a significantly lower cost for medium to high end
loads (defined as #runnable tasks > #cpus)

Here is a simple approximate arithmetic. Assume the following:
- Let X be the fixed overhead to get through the scheduler (cur or MQ)
once.
- Let Y the overhead of touching a task to inspect its goodness
- Let N be the number of tasks
- Let C be the number of cpus.

The cost for the current scheduler is: X(cur) + N*Y.
The cost for the MQ scheduler is: X(MQ)  + (N/C)*Y + C*Y
  I assumed here equal runqueue length.

You can see that this ballpark math shows that if X(cur) ~ X(MQ)
MQ is expected to be better when (N/C) + C < N   or
if (  N  >  C*C/(C-1)  )

C=2 N>4
C=3 N>4
C=4 N>5
C=5 N>6
C=6 N>7
C=7 N>8
C=8 N>9

Turns out that for C>2  this amounts to N > C+1 MQ will do better.
Now this is in the face of lack of runqueue lock contention. When
runqueue lock contention surfaces, then this will shift in favor of MQ.




Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Torrey Hoffman <[EMAIL PROTECTED]>@vger.kernel.org on 04/05/2001
07:53:27 PM

Sent by:  [EMAIL PROTECTED]


To:   "'Timothy D. Witham'" <[EMAIL PROTECTED]>, Linux Kernel List
  <[EMAIL PROTECTED]>
cc:
Subject:  RE: a quest for a better scheduler



Timothy D. Witham wrote :
[...]
> I propose that we work on setting up a straight forward test harness
> that allows developers to quickly test a kernel patch against
> various performance yardsticks.

[...
(proposed large server testbeds)
...]

I like this idea, but could the testbeds also include some
resource-constrained "embedded" or appliance-style systems, and
include performance yardsticks for latency and responsiveness?

It would be unfortunate if work on a revised scheduler resulted
in Linux being a monster web server on 4-way systems, but having
lousy interactive performance on web pads, hand helds, and set top
boxes.

How about a 120Mhz Pentium with 32MB of RAM and a flash disk, a 200
Mhz PowerPC with 64 MB?  Maybe a Transmeta web pad?

For the process load, something that tests responsiveness and
latency - how about a set of processes doing multicast network
transfers, CPU-intensive MPEG video and audio encode / decode,
and a test app that measures "user response", i.e. if X Windows
was running, would the mouse pointer move smoothly or jerkily?

The "better" scheduler for these applications would make sure that
multicast packets were never dropped, the MPEG decode never dropped
frames, and the "user interface" stayed responsive, etc.

Also, I would not mind a bit if the kernel had tuning options, either
in configuration or through /proc to adjust for these very different
situations.

Torrey Hoffman
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: a quest for a better scheduler

2001-04-05 Thread Hubertus Franke


Exellent idea

Assuming that you have set up these environments already,
what would be a real treat is to get the average runqueue
length at a given time, for instance every second or so, while
running some of these more sophisticated server oriented applications
that you mention.

>From that we can see which of the various assumption regarding
runqueue length is holding up, when the runqueue is not empty.
This would help the current discussion trememdously as we don't
seem to be able to even agree on this.

Simple bincount could do. If you want a kernel patch that counts
every scheduling cycle I'll write it.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



"Timothy D. Witham" <[EMAIL PROTECTED]>@vger.kernel.org on 04/05/2001
06:38:41 PM

Sent by:  [EMAIL PROTECTED]


To:   Linux Kernel List <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED]
Subject:  Re: a quest for a better scheduler




  I have been following this thread and thinking that everybody has some
truth in
what they are saying but with the absence of a repeatable test environment
there
really isn't a way of arriving at a data driven decision.

Given the following conditions.

1)The diversity of the problem sets that developers are working on results
in a
  real concern that another developers solution to their performance issue
  might result in a worsening of my performance situation.

2)Most of the developers have a given set of tests that they use when they
  make changes to determine if they change did what they want.

3)The Open Source Development Lab has the faculties for providing a large
scale
  testing environment on several different configurations that remain the
same over
  time.

  I propose that we work on setting up a straight forward test harness that
allows
developers to quickly test a kernel patch against various performance
yardsticks.
If we use the same set of hardware for the generation of the performance
data
from patch to patch there will be a correlation between the runs allowing
for
a real comparison of the different changes.

The following should be taken only as a starting point.

  As for the base hardware configurations I propose:
 2 way with 1 GB main memory and 2 IDE drives
 4 way with 4 GB main memory and 16 disk drives
 8 way with 8 GB main memory and 24 disk drives

  The types of applications that I have heard people express concern are:

 Web infrastructure:
  Apache
  TUX
  Jabber

 Corporate infrastructure:
  NFS
  raw TCP/IP performance
  Samba

 Database performance:
  Raw storage I/O performance
   OLTP workload

 General usage:
  compile speed (usually measured by kernel compile)

   The OSDL also has the ability to make some of the "for fee" benchmarks
available for use for testing that is done onsite (network access to OSDL
machines counts as onsite) so that people don't have to purchase individual
copies.  Things like SECMAIL2001, SPECSFS2.0 and SEPCWEB99 come to mind.

Comments, suggestions, volunteers?

--
Timothy D. Witham - Lab Director - [EMAIL PROTECTED]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office)(503)-702-2871 (cell)
(503)-626-2455 (fax)

On Wed, Apr 04, 2001 at 03:54:54PM -0700, Christopher Smith wrote:
> --On Wednesday, April 04, 2001 15:16:32 -0700 Tim Wright <[EMAIL PROTECTED]>
> wrote:
> > On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote:
> >> nope. The goal is to satisfy runnable processes in the range of
NR_CPUS.
> >> You are playing word games by suggesting that the current behavior
> >> prefers 'low end'. 'thousands of runnable processes' is not 'high end'
> >> at all, it's 'broken end'. Thousands of runnable processes are the
sign
> >> of a broken application design, and 'fixing' the scheduler to perform
> >> better in that case is just fixing the symptom. [changing the
scheduler
> >> to perform better in such situations is possible too, but all
solutions
> >> proposed so far had strings attached.]
> >
> > Ingo, you continue to assert this without giving much evidence to back
it
> > up. All the world is not a web server. If I'm running a large OLTP
> > database with thousands of clients, it's not at all unreasonable to
> > expect periods where several hundred (forget the thousands) want to be
> > serviced by the database engine. That sounds like hundreds of
schedulable
> > entities be they processes or threads or whatever. This sort of load is
> > regularly run on machi

Re: a quest for a better scheduler

2001-04-04 Thread Hubertus Franke


I give you a concrete example:

Running DB2 on an SMP system.

In DB2 there is a processes/thread pool that is sized based
on memory and numcpus. People tell me that the size of this pool
is in the order of 100s for an 8-way system with reasonable
sized database. These  determine the number of agents
that can simultaneously execute an SQL statement.

Requests are flying in for transactions (e.g. driven by TPC-W like
applications). The agents are grepped from the pool and concurrently
fire the SQL transactions.
Assuming that there is enough concurrency in the database, there is
no reason to believe that the majority of those active agents is
not effectively running. TPC-W loads have observed 100 of active
transactions at a time.

Ofcourse limiting the number of agents would reduce concurrently
running tasks, but would limit the responsiveness of the system.
Implementing a database in the kernel ala TUX doesn't seem to be
the right approach either (complexity, fault containment, ...)

Hope that is one example people accept.

I can dig up some information on WebSphere Applications.

I'd love to hear from some other applications that fall into
a similar category as the above and substantiate a bit the need
for 100s of running processes, without claiming that the
application is broke.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Mark Hahn <[EMAIL PROTECTED]> on 04/04/2001 02:28:42 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:
Subject:  Re: a quest for a better scheduler



> ok if the runqueue length is limited to a very small multiple of the
#cpus.
> But that is not what high end server systems encounter.

do you have some example of this in mind?  so far, noone has
actually produced an example of a "high end" server that has
long runqueues.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Hubertus Franke


Correct, that's true.

Our patch does various things.
(a) limit search for a task to a admin specified set of cpu's
during schedule()..
(b) limits search for a preemptable task to another set of cpu's
during reschedule_idle()
 
(c) loadbalancing, i.e. moving from queue to queue.
Currently we balance within a set and across sets.

Obviously in NUMA one could specify
 (a) such that multiple sets fall into the same node
 no node crossings.
 (b) specify this set to at least span a node
 (c) do some intelligent moving based on memory maps
  etc.

I guess (c) would be first instance on where to plug architecture
dependent information, e.g. how much memory footprint does a task
have on a particular node and how much would the moving cost.
The loadbalance we provide is a simple sceleton to tickle you mind,
not a solution. Nevertheless, one can see it can have some impact.

See for results for various combinations of poolsizes and balancings:
http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Kanoj Sarcar <[EMAIL PROTECTED]> on 04/04/2001 01:14:28 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   [EMAIL PROTECTED] (Linux Kernel List),
  [EMAIL PROTECTED]
Subject:  Re: [Lse-tech] Re: a quest for a better scheduler



>
>
>
> Kanoj, our cpu-pooling + loadbalancing allows you to do that.
> The system adminstrator can specify at runtime through a
> /proc filesystem interface the cpu-pool-size, whether loadbalacing
> should take place.

Yes, I think this approach can support the various requirements
put on the scheduler.

I think there are two degrees of freedom that are needed in the
scheduler. One, as you say, for the sysadmin to be able to specify
what overall scheduler behavior he wants.

Secondly, from the kernel standpoint, there needs to be perarch
hooks, to be able to utilize nodelevel/multilevel caches, NUMA
aspects etc.

Kanoj



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: a quest for a better scheduler

2001-04-04 Thread Hubertus Franke


Well put, this how we can eliminate searching all bins or lists and that's
how we do it under.
http://lse.sourceforge.net/scheduling/2.4.1-pre8-prioSched.

If you have a list per priority level, then you can even pick the first one
you find if its on
the same level. That's what I tried in a more recent implementation. Also
combined that
with using a bitmask to represent non-empty tasks.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Davide Libenzi <[EMAIL PROTECTED]>@ewok.dev.mycio.com on 04/04/2001
12:12:54 PM

Sent by:  [EMAIL PROTECTED]


To:   Ingo Molnar <[EMAIL PROTECTED]>
cc:   Linus Torvalds <[EMAIL PROTECTED]>, Alan Cox
  <[EMAIL PROTECTED]>, Linux Kernel List
  <[EMAIL PROTECTED]>, Hubertus Franke/Watson/IBM@IBMUS,
  Mike Kravetz <[EMAIL PROTECTED]>, Fabio Riccardi
  <[EMAIL PROTECTED]>
Subject:  Re: a quest for a better scheduler




On 04-Apr-2001 Ingo Molnar wrote:
>
> On Tue, 3 Apr 2001, Fabio Riccardi wrote:
>
>> I've spent my afternoon running some benchmarks to see if MQ patches
>> would degrade performance in the "normal case".
>
> no doubt priority-queue can run almost as fast as the current scheduler.
> What i'm worried about is the restriction of the 'priority' of processes,
> it cannot depend on previous processes (and other 'current state')
> anymore.
>
> to so we have two separate issues:
>
>#1: priority-queue: has the fundamental goodness() design limitation.
>
>#2: per-CPU-runqueues: changes semantics, makes scheduler less
> effective due to nonglobal decisions.
>
> about #1: while right now the prev->mm rule appears to be a tiny issue
(it
> might not affect performance significantly), but forbidding it by
> hardcoding the assumption into data structures is a limitation of
*future*
> goodness() functions. Eg. with the possible emergence of CPU-level
> threading and other, new multiprocessing technologies, this could be a
> *big* mistake.

This is not correct Ingo. I haven't seen the HP code but if You store
processes
in slots S :

S = FS( goodness(p, p->processor, p->mm) )

and You start scanning from the higher slots, as soon as you find a task
with a
goodness G' that is equal to the max goodness in slot You can choose that
process to run.
Again, if You haven't found such a goodness during the slot scan but You've
found a task with a goodness G' :

G' >= SG - DD

where :

SG = max slot goodness
DD = SG(i) - SG(i - 1)

You can select that task as the next to spin.
This was the behaviour that was implemented in my scheduler patch about 2
years
ago.
Beside this, I this that with such loads We've more serious problem to face
with inside the kernel.



- Davide




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Hubertus Franke



Kanoj, our cpu-pooling + loadbalancing allows you to do that.
The system adminstrator can specify at runtime through a
/proc filesystem interface the cpu-pool-size, whether loadbalacing
should take place.
We can put limiting to the local cpu-set during reschedule_idle
back into the code, to make it complete and compatible with
the approach that Andrea has taken.
This way, one can fully isolate or combine cpu-sets.

here is the code for the pooling.
http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool

loadbalancing and /proc system combined in this module.
http://lse.sourceforge.net/scheduling/LB/loadbalance.c

a writeup explaining this concept is available under
http://lse.sourceforge.net/scheduling/LB/poolMQ.html

Prerequisite is the MQ scheduler...
http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched

We need to update these for 2.4.3  (coming)

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Kanoj Sarcar <[EMAIL PROTECTED]>@lists.sourceforge.net on
04/04/2001 12:50:58 PM

Sent by:  [EMAIL PROTECTED]


To:   [EMAIL PROTECTED] (Andrea Arcangeli)
cc:   [EMAIL PROTECTED] (Ingo Molnar), Hubertus Franke/Watson/IBM@IBMUS,
  [EMAIL PROTECTED] (Mike Kravetz), [EMAIL PROTECTED] (Fabio
  Riccardi), [EMAIL PROTECTED] (Linux Kernel List),
  [EMAIL PROTECTED]
Subject:  Re: [Lse-tech] Re: a quest for a better scheduler



>
> I didn't seen anything from Kanoj but I did something myself for the
wildfire:
>
>
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

>
> this is mostly an userspace issue, not really intended as a kernel
optimization
> (however it's also partly a kernel optimization). Basically it splits the
load
> of the numa machine into per-node load, there can be unbalanced load
across the
> nodes but fairness is guaranteed inside each node. It's not extremely
well
> tested but benchmarks were ok and it is at least certainly stable.
>

Just a quick comment. Andrea, unless your machine has some hardware
that imply pernode runqueues will help (nodelevel caches etc), I fail
to understand how this is helping you ... here's a simple theory though.
If your system is lightly loaded, your pernode queues are actually
implementing some sort of affinity, making sure processes stick to
cpus on nodes where they have allocated most of their memory on. I am
not sure what the situation will be under huge loads though.

As I have mentioned to some people before, percpu/pernode/percpuset/global
runqueues probably all have their advantages and disadvantages, and their
own sweet spots. Wouldn't it be really neat if a system administrator
or performance expert could pick and choose what scheduler behavior he
wants, based on how the system is going to be used?

Kanoj

___
Lse-tech mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lse-tech



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: a quest for a better scheduler

2001-04-04 Thread Hubertus Franke



You imply that high end means thousands of processes,
simply because we have shown that in our graphs as an
asymptotic end.
No, it could mean 5*#cpus and that is not all that absurd.
This could happen with a spike in demand.

TUX is not the greatest example to use, because it does
static webpage serving and is hence tied into the
file cache. If you move up the food chain, where middleware
is active and things are a bit more complicated than
sending stuff out of the filecache, having a bunch of threads
hanging around to deal with the spike in demand is common
practive, although you think its bad.

Now coming again to MQ (forget about priority list from
now on).

When we scan either the local or the realtime queues we do
use goodness value. So we have the same flexibility as the
current scheduler if it comes to goodness() flexibility and
future improvements.

For remote stealing we do use na_goodness to compare
for a better process to steal. Hence we would loose the "+1"
information here, nevertheless, once a decision
has been made, we still use preemption verification with goodness.
Eitherway, being off by "+1" for regular tasks once in a while
is no big deal, because this problem already exists today.
While walking the runqueue, another processor can either update
the counter value of task (ok that's covered by can_schedule)
or can run recalculate, which makes the decision that one is
about to make wrong from the point of always running the best.
But that's ok, because counter, nice etc. are approximations
anyway. Through in PROC_CHANGE_PENALTY and you have a few knobs
that are used to control interactivity and throughput.

If you look at some of the results with our reflex benchmark.
For the low thread count we basically show improved performance
on the 2,4,5,6,7,8 way system if #threads < #cpus, they all show
improvements. Look at the following numbers running the
reflex benchmark, left most columns are number of threads, with
typically 1/2 of them runnable.
You can clearly see that the priority list suffers from overhead,
but MQ is beating vanilla pretty much everywhere. Again, this
is due because of rapid scheduler invocation and resulting
lock contention.

  2-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
46.725  4.691  7.387
86.326  4.766  6.421
12   6.838  5.233  6.431
16   7.13  5.415  7.29

  4-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
49.42  7.873  10.592
88.143  3.799  8.691
12   7.877  3.537  8.101
16   7.688  2.953  7.144

  6-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
49.595  7.88  10.358
89.703  7.278  10.523
12   10.0164.652  10.985
16   9.882  3.629  10.525

  8-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
49.804  8.033  10.548
810.4365.783  11.475
12   10.9256.787  11.646
16   11.4265.048  11.877
20   11.4383.895  11.633
24   11.4573.304  11.347
28   11.4953.073  11.09
32   11.53  2.944  10.898


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Ingo Molnar <[EMAIL PROTECTED]>@elte.hu> on 04/04/2001 09:23:34 AM

Please respond to <[EMAIL PROTECTED]>

Sent by:  <[EMAIL PROTECTED]>


To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   Mike Kravetz <[EMAIL PROTECTED]>, Fabio Riccardi
  <[EMAIL PROTECTED]>, Linux Kernel List
  <[EMAIL PROTECTED]>
Subject:  Re: a quest for a better scheduler




On Wed, 4 Apr 2001, Hubertus Franke wrote:

> I understand the dilemma that the Linux scheduler is in, namely
> satisfy the low end at all cost. [...]

nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
You are playing word games by suggesting that the current behavior prefers
'low end'. 'thousands of runnable processes' is not 'high end' at all,
it's 'broken end'. Thousands of runnable processes are the sign of a
broken application design, and 'fixing' the scheduler to perform better in
that case is just fixing the symptom. [changing the scheduler to perform
better in such situations is possible too, but all solutions proposed so
far had strings attached.]

 Ingo




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: a quest for a better scheduler

2001-04-04 Thread Hubertus Franke


Yes, Andrea.

We actually already went a step further. We treat the scheduler
as a single entity, rather than splitting it up.

Based on the MQ scheduler we do the balancing across all nodes
at reschedule_idle time. We experimented to see whether only looking
for idle tasks remotely is a good idea, but it bloats the code.
Local scheduling decisions are limited to a set of cpus, which
could coincide with the cpus of one node, or if desirable on
smaller granularities.

In addition we implemented a global load balancing scheme that
ensures that load is equally distributed across all run queues.
This is made a loadable module, so you can plug in what ever
you want.

I grant in NUMA it might actually be desirable to separate
schedulers completely (we can do that trivially in reschedule_idle),
but you need loadbalancing at some point.

Here is the list of patches:

MultiQueue Scheduler: http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched
Pooling Extension:http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool
LoadBalancing:
http://lse.sourceforge.net/scheduling/LB/loadbalance.c


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Andrea Arcangeli <[EMAIL PROTECTED]> on 04/04/2001 11:08:47 AM

To:   Ingo Molnar <[EMAIL PROTECTED]>
cc:   Hubertus Franke/Watson/IBM@IBMUS, Mike Kravetz
  <[EMAIL PROTECTED]>, Fabio Riccardi <[EMAIL PROTECTED]>, Linux
  Kernel List <[EMAIL PROTECTED]>,
  [EMAIL PROTECTED]
Subject:  Re: a quest for a better scheduler



On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote:
>
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
>
> > Another point to raise is that the current scheduler does a exhaustive
> > search for the "best" task to run. It touches every process in the
> > runqueue. this is ok if the runqueue length is limited to a very small
> > multiple of the #cpus. [...]
>
> indeed. The current scheduler handles UP and SMP systems, up to 32
> (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
> approach anyway in many other subsystems too, Kanoj is doing some
> scheduler work in that area.

I didn't seen anything from Kanoj but I did something myself for the
wildfire:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1


this is mostly an userspace issue, not really intended as a kernel
optimization
(however it's also partly a kernel optimization). Basically it splits the
load
of the numa machine into per-node load, there can be unbalanced load across
the
nodes but fairness is guaranteed inside each node. It's not extremely well
tested but benchmarks were ok and it is at least certainly stable.

However Ingo consider that in a 32-way if you don't have at least 32 tasks
running all the time _always_ you're really stupid paying such big money
for
nothing ;). So the fact the scheduler is optimized for 1/2 tasks running
all
the time is not nearly enough for those machines (and of course also the
scheduling rate automatically increases linearly with the increase of the
number of cpus). Now it's perfectly fine that we don't ask the embedded and
desktop guys to pay for that, but a kernel configuration option to select
an
algorithm that scales would be a good idea IMHO. The above patch just adds
a
CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will
be
hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles).

Andrea



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: a quest for a better scheduler

2001-04-04 Thread Hubertus Franke


I grant you that the code is not as clean as the current scheduler,
so maybe you missed that part.

For the priority scheduler:
Yes the task_to_qid assumes a NON-affinity (no cpu, no mm) to determine
the list index to where the task has to be enqueued.
However, if you wonder down to the search_range section of the scheduler,
you see that we  do add the "+1" for equal mm. All I did here was take
the goodness() function a part for search_range in order to insert some
extra code that speeds up the code.

Again, I don't want to lean to hard on the priority scheduler, because
we only did this to have a reference point regarding lock contention and
lock hold. This is stuff that many operating systems did years ago, most
of which have moved on to add MQ to that. So that is what the LSE
scheduling team is pushing.

I understand the dilemma that the Linux scheduler is in, namely satisfy
the low end at all cost. But I believe that in order for Linux to push
into the high end we need to address the scalability of the current
scheduler.
Simply handwaving and declaring that lots of running tasks is a stupid
idea doesn't get us there.
If indeed you assume that there #running-tasks ~ #cpus then each task
should alot a reasonable amount of work and any small overhead incurred
should be amortizable. However as we contend if #running-tasks >> #cpus
is a situation we need to deal with, the living with the current lock
contention can really drag us down.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Ingo Molnar <[EMAIL PROTECTED]>@elte.hu> on 04/04/2001 02:28:31 AM

Please respond to <[EMAIL PROTECTED]>

Sent by:  <[EMAIL PROTECTED]>


To:   Mike Kravetz <[EMAIL PROTECTED]>
cc:   Hubertus Franke/Watson/IBM@IBMUS, Fabio Riccardi
  <[EMAIL PROTECTED]>, Linux Kernel List
  <[EMAIL PROTECTED]>
Subject:  Re: a quest for a better scheduler




On Tue, 3 Apr 2001, Mike Kravetz wrote:

> Our 'priority queue' implementation uses almost the same goodness
> function as the current scheduler.  The main difference between our
> 'priority queue' scheduler and the current scheduler is the structure
> of the runqueue.  We break up the single runqueue into a set of
> priority based runqueues. The 'goodness' of a task determines what
> sub-queue a task is placed in.  Tasks with higher goodness values are
> placed in higher priority queues than tasks with lower goodness
> values. [...]

we are talking about the same thing, re-read my mail. this design assumes
that 'goodness' is constant in the sense that it does not depend on the
previous process.

and no, your are not using the same goodness() function, you omitted the
prev->mm == next->mm component to goodness(), due to this design
limitation.

 Ingo





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: a quest for a better scheduler

2001-04-04 Thread Hubertus Franke



This is an important point that Mike is raising and it also addresses a
critique that Ingo issued yesterday, namely interactivity and fairness.
The HP scheduler completely separates the per-CPU runqueues and does
not take preemption goodness or alike into account. This can lead to
unfair proportionment of CPU cycles, strong priority inversion and a
potential
lack of interactivity.

Our MQ scheduler does yield the same decision in most cases
(other than defined by some race condition on locks and counter members)

It is not clear that yielding the same decision as the current scheduler
is the ultimate goal to shoot for, but it allows comparision.

Another point to raise is that the current scheduler does a exhaustive
search
for the "best" task to run. It touches every process in the runqueue. this
is
ok if the runqueue length is limited to a very small multiple of the #cpus.
But that is not what high end server systems encounter.

With the rising number of processors, lock contention can quickly become
a bottleneck. If we assume that load (#running-task) increases somewhat
linear with #cpus, the problem gets even worth because not only have I
increased the number of clients but also the lock hold time.

Clinging on to the statement that #running-tasks ~ #cpus, ofcourse saves
you from that dilemma, but not everybody is signing on to this limitation.

MQ and priority-list help in 2 ways.

MQ reduces the average lock holdtime because on average it only inspects
#running-tasks/#cpus tasks to make a local decision. It then goes on to
inspect (#cpus-1) datastructures representing the next best to run tasks
on those remote cpus all without holding the lock, thus substantially
reducing lock contention. Note we still search the entire set of runnable
tasks, however we do it in a distributed collaborative manner.
The only time we deviate from the current scheduler decision is in the
case when two cpus have identified the same remote task as a target for
remote stealing. In that case one will win and the other cpu will continue
looking somewhere else, although there might have been another tasks
on that cpu to steal.

priority list schedulers (PRS) only helps in reducing lock hold time,
which can result in some relieve wrt lock contention, but not a whole lot.
PRS can limit the lists it has to search based on the PROC_CHANGE_PENALTY.
It also keeps 0-counter in a list that is never inspected. One can
even go further and put YIELD tasks in a separate list, given that the
sys_sched_yield already does some optimizations.
The older version (12/00) posted on LSE is functionally equivalent to the
current scheduler.
I will put up another version this week that is based on a bitmask and
which is a bit more agressive in that it ignores the MM goodness boost of 1
which in my books is merely a tie breaker between two task of equal
goodness.

Beyond that we have done some work on cpu pooling, which is to identify
a set of cpus that form a scheduling set. We still consider in
reschedule_idle all cpus for preemption but in schedule it is sufficient
to only schedule within the own set. That again can limit lock hold time
with MQ and we have seen some improvements. We also deploy load balacing.

To summarize, we have taken great care of trying to preserve the current
scheduler semantics and have laid out a path to relax some of the semantics
for further improvements.
I don't believe that the HP scheduler is sufficient since it is lacking
load balacing, which naturally occurs in our MQ scheduler, and it lacks
the interactivity requirements that Ingo pointed out.

Most of these things are discussed in great detail in the writeups under
lse.sourceforge.net/scheduling. I also believe we show there that the
MQ performance for low thread counts is also matching the vanilla case.

I further don't understand the obsession of keeping the scheduler simple.
If there are improvements and I don't believe the MQ is all that
complicated
and its well documented, why not put it in.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Mike Kravetz <[EMAIL PROTECTED]> on 04/03/2001 10:47:00 PM

To:   Fabio Riccardi <[EMAIL PROTECTED]>
cc:   Mike Kravetz <[EMAIL PROTECTED]>, Ingo Molnar <[EMAIL PROTECTED]>,
  Hubertus Franke/Watson/IBM@IBMUS, Linux Kernel List
  <[EMAIL PROTECTED]>, Alan Cox <[EMAIL PROTECTED]>
Subject:  Re: a quest for a better scheduler



On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote:
>
> I have measured the HP and not the "scalability" patch because the two do
more
> or less the same thing and give me the same performance advantages, but
the
> former is a lot simpler and I could port it with no effort on any recent
> kernel.

Actually, there is a significant difference between the HP patch a

Re: [Lse-tech] more on scheduler benchmarks

2001-01-22 Thread Hubertus Franke


Mike,

Deactivating that optimization is a good idea.
What we are interested in is what the general latency of the scheduler code
is. This should help to determine that.

The only problem I have with sched_yield like benchmarks is that it creates
artificial lock contention as we basically spent most of the time other
then context switching + syscall under the scheduler lock. This we won't
see in real apps, that's why I think the chatroom numbers are probably
better indicators.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Mike Kravetz <[EMAIL PROTECTED]>@lists.sourceforge.net on 01/22/2001
01:17:38 PM

Sent by:  [EMAIL PROTECTED]


To:   [EMAIL PROTECTED]
cc:   [EMAIL PROTECTED], Ingo Molnar <[EMAIL PROTECTED]>
Subject:  [Lse-tech] more on scheduler benchmarks



Last week while discussing scheduler benchmarks, Bill Hartner
made a comment something like the following "the benchmark may
not even be invoking the scheduler as you expect".  This comment
did not fully sink in until this weekend when I started thinking
about changes made to sched_yield() in 2.4.0.  (I'm cc'ing Ingo
Molnar because I think he was involved in the changes).  If you
haven't taken a look at sys_sched_yield() in 2.4.0, I suggest
that you do that now.

A result of new optimizations made to sys_sched_yield() is that
calling sched_yield() does not result in a 'reschedule' if there
are no tasks waiting for CPU resources.  Therefore, I would claim
that running 'scheduler benchmarks' which loop doing sched_yield()
seem to have little meaning/value for runs where the number of
looping tasks is less than then number of CPUs in the system.  Is
that an accurate statement?

If the above is accurate, then I am wondering what would be a
good scheduler benchmark for these low task count situations.
I could undo the optimizations in sys_sched_yield() (for testing
purposes only!), and run the existing benchmarks.  Can anyone
suggest a better solution?

Thanks,
--
Mike Kravetz [EMAIL PROTECTED]
IBM Linux Technology Center
15450 SW Koll Parkway
Beaverton, OR 97006-6063 (503)578-3494

___
Lse-tech mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lse-tech



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Lse-tech] Re: multi-queue scheduler update

2001-01-22 Thread Hubertus Franke


Per popular demand. Here are a few numbers for small thread counts
running the sched_yield_test benchmark on a 2-way SMP with the following
characteristics.

model name  : Pentium III (Katmai)
stepping: 3
cpu MHz : 551.266
cache size  : 512 KB

I compare 2.4.1-pre8  kernels (vanilla, table/prio scheduler and
multiqueue).

#T : van   prio  MQ
--
 1 : 0.591 0.582 0.750
 2 : 0.295 0.293 0.377
 3 : 2.091 2.373 1.010
 4 : 1.894 1.783 1.558
 5 : 1.949 1.794 1.591
 6 : 2.003 1.803 1.605
 7 : 2.050 1.805 1.654
 8 : 2.118 1.816 1.676
 9 : 2.174 1.811 1.708
10 : 2.235 1.821 1.744
11 : 2.304 1.823 1.780
12 : 2.365 1.831 1.863
13 : 2.427 1.829 1.870
14 : 2.494 1.841 1.950
15 : 2.578 1.839 1.959
16 : 2.691 1.865 2.043
17 : 2.804 1.855 2.041
18 : 2.893 1.873 2.127
19 : 3.001 1.851 2.079
20 : 3.098 1.878 2.182
21 : 3.191 1.851 2.178
22 : 3.263 1.884 2.233
23 : 3.332 1.850 2.231
24 : 3.403 1.901 2.272
25 : 3.472 1.865 2.251
26 : 3.540 1.923 2.305
27 : 3.604 1.872 2.295
28 : 3.680 1.900 2.333
29 : 4.204 1.883 2.329
30 : 4.256 1.944 2.358
31 : 3.875 1.936 2.325
32 : 4.476 1.953 2.339


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Andrea Arcangeli <[EMAIL PROTECTED]>@lists.sourceforge.net on 01/18/2001
08:30:41 PM

Sent by:  [EMAIL PROTECTED]


To:   Mike Kravetz <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject:  Re: [Lse-tech] Re: multi-queue scheduler update



On Thu, Jan 18, 2001 at 04:52:25PM -0800, Mike Kravetz wrote:
> was less than the number of processors.  I'll give the tests a try
> with a smaller number of threads.  I'm also open to suggestions for

OK!

> what benchmarks/test methods I could use for scheduler testing.  If
> you remember what people have used in the past, please let me know.

It was this one IIRC (it spawns threads calling sched_yield() in loop).

/*
  Tester for the kernel's speed in scheduling.
  (C) 1999 / Willy Tarreau <[EMAIL PROTECTED]>

  Modified by Davide Libenzi <[EMAIL PROTECTED]>


  You can do whatever you want with this program, but I'm not
  responsible for any misuse. Be aware that it can heavily load
  a host. As it is multithreaded, it might take advantages of SMP.

  It basically creates a growing amount of threads and measures
  their cumulative work (i.e. loop iterations/second). The output
  is easily useable by gnuplot.

  To compile, you need libpthread :

 gcc -O2 -fomit-frame-pointer -o threads threads.c -lpthread

  Output on stdout is :


*/

#include 
#include 
#include 
#include 
#include 



#define MAXTHREADS  450
#define MEASURE_TIME 60



pthread_t   thr[MAXTHREADS];
int nbthreads = MAXTHREADS;
int measure_time = MEASURE_TIME;
volatileactthreads = 0;

long long int   totalwork[MAXTHREADS];
volatile intstop = 0,
start = 0,
count = 0;

voidoneatwork(int thr)
{
int i;
while (!start)  /* don't disturb pthread_create() */
usleep(1);

actthreads++;
while (!stop)
{
if (count)
totalwork[thr]++;

syscall(158); /* sys_sched_yield() */
}
actthreads--;
pthread_exit(0);
}

main(int argc, char **argv)
{

int i,
err,
avgwork,
thrzero;
long long int   value,
avgvalue;
double  sqrdev;
time_t  ts,
te;

if (argc < 3)
{
printf("usage: %s  threads  time\n", argv[0]);
exit(1);
}

nbthreads = atoi(argv[1]);
measure_time = atoi(argv[2]);


start = 0;
count = 0;
stop = 0;
actthreads = 0;
thrzero = 0;
value = 0;
sqrdev = 0.0;

fprintf(stderr, "\nCreating %d threads ...", nbthreads);
for (i = 0; i < nbthreads; i++)
{
if ((err = pthread_create(&thr[i], NULL, (void *) &oneatwork, (void
*) i)) != 0)
{
fprintf(stderr, "thread %d pthread_create=%d -> ", i, err);
perror("");
exit(1);
}
pthread_detach(thr[i]);
}

for (i = 0; i < nbthreads; i++)
totalwork[i] = 0;

fprintf(stderr, " OK !\nWaiting for all threads to start ...");

start = 1;
while (actthreads != nbthreads)
usleep(1); /* waiting for a bit of stability */

fprintf(stderr, "Go !\n");

count = 1;
time(&ts);

sleep(measure_time);

count = 0;
stop = 1;
time(&te);


for (i = 0; i < nbthreads; i++)
{
value += totalwork[i];
if (totalwork[i] == 0)
++thrzero;
}
avgvalue = value / nbthreads

Re: [Lse-tech] Re: multi-queue scheduler update

2001-01-19 Thread Hubertus Franke


Mike sounds good, we will do all our measurements from now on with thread
count for the entire range from 1 to 16 and
then in power of twos upto 2048 and for maxcpus=1,2,4,6,8. Do you think
that 4096 is overkill ? So far the numbers you got and we got over here are
the same. Andi suggested that  has some problems with IO scheduling.

You are right, "hopefully + ballpark" ~= 10%.
As for intelligent decisions, the general loadbalancing that we already
started might help out a bit here.

Other stuff we could look into
Remember we talked about counting active idle threads at some point.

if (active_idle_threads < smp_num_cpus) {
 /* now we know that we simply give it to the first idle_thread found,
instead of
  * collecting the max_na_goodness value and somewhat sorting through
it
  * similar to the current Vanilla algorithm
  */
} else {
 /* current MQ algorithm */
}

Just shooting from the hip here, lets restart this discussion.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Mike Kravetz <[EMAIL PROTECTED]> on 01/19/2001 12:11:04 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject:  Re: [Lse-tech] Re: multi-queue scheduler update



On Fri, Jan 19, 2001 at 10:47:06AM -0500, Hubertus Franke wrote:

> What you can see from these numbers is that MQ does an awesome job up to
> 1024 threads. When measuring in the future, we will take from now on the
> general concern about low number of threads into account. Your points are
> well taken. I m pretty confident our MQ scheduler will be in reasonable
> ballpark of the current scheduler.


Hubertus,

'Hopefully' the multi-queue scheduler will be in the ballpark for
low number of threads.  However, remember the extra overhead being
incurred in the current implementation.  To maintain existing
scheduler behavior, we look at all CPU specific runqueues to find
the highest priority (goodness) task in the system.  Therefore,
when running with a single thread on an 8 processor system, we
examine 8 runqueues instead of the single global runqueue.  In
a test where tasks are simply spinning doing sched_yield()s, I
suspect this difference may be significant.

I'll run the IIRC benchmark with low thread counts, and post the
results.  In adition, I have some ideas on how to make intelligent
decisions to avoid examining all runqueueus when the number of
running tasks is less than the number of processors.

--
Mike Kravetz [EMAIL PROTECTED]
IBM Linux Technology Center



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Lse-tech] Re: multi-queue scheduler update

2001-01-19 Thread Hubertus Franke


In the sched_yield benchmark case this is not a problem , because the
threads don't have any memory footprint, all cloned.
The chatroom, I agree with you. However, I assume that these big irons
(8-ways) will be pretty much loaded with at least 1MB cache. Maybe at this
point another cite with an 8-way system and small cache could run this. I
don't know whether those actually exists.

Alternatively, we could setup a smaller 4-way system (we have a 4-way
300MHZ-P-II Xeon, with 512MB cache) that would fit into your class and we
could also collect the numbers on those and post those.

We are automizing the reboot process right now where we are modifying the
lilol.conf so we can run many tests with different "maxcpus=.." unattended.

So little to do, so much time... ahhh make that so little time, so much to
do.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



[EMAIL PROTECTED] on 01/19/2001 11:33:34 AM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   David Lang <[EMAIL PROTECTED]>, Mike Kravetz
  <[EMAIL PROTECTED]>, Andrea Arcangeli <[EMAIL PROTECTED]>,
  [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject:  Re: [Lse-tech] Re: multi-queue scheduler update



You might want to rerun the tests with less cache heavy procs.  The 2meg
xeons you are using could distort things from what the average linux user
would see (running with 256-512k cache).
     Nick

On Fri, 19 Jan 2001, Hubertus Franke wrote:

>
> Sure, we are measuring that as well.
> We are running all these benchmarks and configurations that I mentioned
in
> my previous message on
> 1-2-4-6- and 8 way configurations.
> We have posted some preliminary results on older kernels on the website:
>
> http://lse.sourceforge.net/scheduling/prelim.html
>
> MQ scheduler is meaningless for a UP kernel that is only build under the
> SMP flag.
> The priority==tablebased scheduler does make sense to run on a UP (i.e.
not
> SMP compiled) kernel.
> Some more fine-tuning of the current code base might improve that case,
> because affinity is not a concern
> I can simply go to my top table hash, retrieve the first P entry with
> !P->has_cpu and I am ready to go.
>
> Hubertus Franke
> Enterprise Linux Group (Mgr),  Linux Technology Center (Member
Scalability)
> , OS-PIC (Chair)
> email: [EMAIL PROTECTED]
> (w) 914-945-2003(fax) 914-945-4425   TL: 862-2003
>
>
>
> David Lang <[EMAIL PROTECTED]>@lists.sourceforge.net on 01/19/2001
> 11:06:37 AM
>
> Sent by:  [EMAIL PROTECTED]
>
>
> To:   Mike Kravetz <[EMAIL PROTECTED]>
> cc:   Andrea Arcangeli <[EMAIL PROTECTED]>,
<[EMAIL PROTECTED]>,
>   <[EMAIL PROTECTED]>
> Subject:  Re: [Lse-tech] Re: multi-queue scheduler update
>
>
>
> another thing that would be interesting is what is the overhead on UP or
> small (2-4 way) SMP machines
>
> David Lang
>
> On Thu, 18 Jan 2001, Mike Kravetz wrote:
>
> > Date: Thu, 18 Jan 2001 16:52:25 -0800
> > From: Mike Kravetz <[EMAIL PROTECTED]>
> > To: Andrea Arcangeli <[EMAIL PROTECTED]>
> > Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
> > Subject: Re: [Lse-tech] Re: multi-queue scheduler update
> >
> > On Fri, Jan 19, 2001 at 01:26:16AM +0100, Andrea Arcangeli wrote:
> > > On Thu, Jan 18, 2001 at 03:53:11PM -0800, Mike Kravetz wrote:
> > > > Here are some very preliminary numbers from sched_test_yield
> > > > (which was previously posted to this (lse-tech) list by Bill
> > > > Hartner).  Tests were run on a system with 8 700 MHz Pentium
> > > > III processors.
> > > >
> > > >microseconds/yield
> > > > # threads  2.2.16-22   2.42.4-multi-queue
> > > >    -  ---
> > > > 16   18.7404.603 1.455
> > >
> > > I remeber the O(1) scheduler from Davide Libenzi was beating the
> mainline O(N)
> > > scheduler with over 7 tasks in the runqueue (actually I'm not sure if
> the
> > > number was 7 but certainly it was under 10). So if you also use a
O(1)
> > > scheduler too as I guess (since you have a chance to run fast on the
> lots of
> > > tasks running case) the most interesting thing is how you score with
> 2/4/8
> > > tasks in the runqueue (I think the tests on the O(1) scheduler patch
> was done
> > > at max on a 2-way SMP btw). (the argument for which Davide's patch
> wasn't
> > > included is that most machines have less than 4/5 tasks in t

Re: [Lse-tech] Re: multi-queue scheduler update

2001-01-19 Thread Hubertus Franke


Sure, we are measuring that as well.
We are running all these benchmarks and configurations that I mentioned in
my previous message on
1-2-4-6- and 8 way configurations.
We have posted some preliminary results on older kernels on the website:

http://lse.sourceforge.net/scheduling/prelim.html

MQ scheduler is meaningless for a UP kernel that is only build under the
SMP flag.
The priority==tablebased scheduler does make sense to run on a UP (i.e. not
SMP compiled) kernel.
Some more fine-tuning of the current code base might improve that case,
because affinity is not a concern
I can simply go to my top table hash, retrieve the first P entry with
!P->has_cpu and I am ready to go.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



David Lang <[EMAIL PROTECTED]>@lists.sourceforge.net on 01/19/2001
11:06:37 AM

Sent by:  [EMAIL PROTECTED]


To:   Mike Kravetz <[EMAIL PROTECTED]>
cc:   Andrea Arcangeli <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>,
  <[EMAIL PROTECTED]>
Subject:  Re: [Lse-tech] Re: multi-queue scheduler update



another thing that would be interesting is what is the overhead on UP or
small (2-4 way) SMP machines

David Lang

On Thu, 18 Jan 2001, Mike Kravetz wrote:

> Date: Thu, 18 Jan 2001 16:52:25 -0800
> From: Mike Kravetz <[EMAIL PROTECTED]>
> To: Andrea Arcangeli <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
> Subject: Re: [Lse-tech] Re: multi-queue scheduler update
>
> On Fri, Jan 19, 2001 at 01:26:16AM +0100, Andrea Arcangeli wrote:
> > On Thu, Jan 18, 2001 at 03:53:11PM -0800, Mike Kravetz wrote:
> > > Here are some very preliminary numbers from sched_test_yield
> > > (which was previously posted to this (lse-tech) list by Bill
> > > Hartner).  Tests were run on a system with 8 700 MHz Pentium
> > > III processors.
> > >
> > >microseconds/yield
> > > # threads  2.2.16-22   2.42.4-multi-queue
> > >    -  ---
> > > 16   18.7404.603 1.455
> >
> > I remeber the O(1) scheduler from Davide Libenzi was beating the
mainline O(N)
> > scheduler with over 7 tasks in the runqueue (actually I'm not sure if
the
> > number was 7 but certainly it was under 10). So if you also use a O(1)
> > scheduler too as I guess (since you have a chance to run fast on the
lots of
> > tasks running case) the most interesting thing is how you score with
2/4/8
> > tasks in the runqueue (I think the tests on the O(1) scheduler patch
was done
> > at max on a 2-way SMP btw). (the argument for which Davide's patch
wasn't
> > included is that most machines have less than 4/5 tasks in the runqueue
at the
> > same time)
> >
> > Andrea
>
> Thanks for the suggestion.  The only reason I hesitated to test with
> a small number of threads is because I was under the assumption that
> this particular benchmark may have problems if the number of threads
> was less than the number of processors.  I'll give the tests a try
> with a smaller number of threads.  I'm also open to suggestions for
> what benchmarks/test methods I could use for scheduler testing.  If
> you remember what people have used in the past, please let me know.
>
> --
> Mike Kravetz [EMAIL PROTECTED]
> IBM Linux Technology Center
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
>

___
Lse-tech mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lse-tech



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Lse-tech] Re: multi-queue scheduler update

2001-01-19 Thread Hubertus Franke



Indeed, Andi,  we tried that  priority==tablebased scheduler approach. If
you check the call for participation again, what
we are trying to do is to get to the bottom of what actually impacts
scheduler performance and subsequently
come up with a combined best bread (i.e. satisfies the highend and low
end). Since this is still work in progress, here
are a few numbers that I got from running the 2.4.0-test12 kernels for
vanilla and priority based complementing Mike's numbers.
I add this as an extra columns to Mikes table. Our Machine is 8-way 700 MHZ
Pentium 2MB caches, though I don't think
for the sched_yield test it makes a difference. I ran with 50 seconds
runtime per test to get by the FRC problem.

   microseconds/yield
#threads 2.2.16-22  2.42.4-MQ  2.4.0-test12
2.4.0-test12-Prio
--   - --  
-
16 18.740  4.6031.455   4.51  4.39
32 17.702  5.1341.456   5.01  4.06
64 23.300  5.5861.466   5.70  3.99
12847.273 18.8121.480  12.06 %3.99
256105.70171.1471.517  60.2   4.05
512FRC   143.5001.661 132.5   4.19
1024   FRC   196.4256.166 295.4 # 4.57
2048   FRC FRC 23.291 460.4   5.34
4096   FRC FRC 47.117 631.3   5.91

*FRC = failed to reach confidence level

Some comments to some numbers:
#) Mike measure 196, I measured 295 ?? Somebody has a typo here I assume.
%) This actually varied between 8 and 14 on multiple runs averaging 12.
Bill Hartner suggests that these might be cache issues (OT).

What you can see from these numbers is that MQ does an awesome job up to
1024 threads. When measuring in the future, we will take from now on the
general concern about low number of threads into account. Your points are
well taken. I m pretty confident our MQ scheduler will be in reasonable
ballpark of the current scheduler. To go on, the priority==tablebased
scheduler does better for very high number of processes. It actually beats
the vanilla version throughout (>= 16). It stays stable, because we stop
immediately when we found a process that run last on the invoking cpu. Only
way we could do better is to continue searching for a affinity boost due to
. Here the discussions might start. The next version of the tablebased
scheduler will take into account whether the table index only covers one
goodness range or multiple (e.g. RT). This could give some better
performance for the general case.

The roadmap ahead for Mike and I and the rest of the crew is to combine
these methods. In our first attempt we first wanted to demonstrate that the
MQ does a great job while emulating current scheduler semantics. Now if we
relax these semantics just a bit, e.g. we would be tolerating a bit more
priority inversion (which any scheduler does that deploys affinity boosts),
we probably can do even better.

These are the things we are currently doing and soon should have some
results now:

(1) We are preparing for LWE with a full  measurement of the latest kernel.
For this purpose we have frozen to 2.4.1-pre8.
Unless ofcourse you are telling us this is not a good kernel to run on.
(2) We will measure 1-4096 threads for vanilla, priority and MQ for two
tests (both provided by Bill Hartner in Austin).
 (a) sched_yield  although not a meaningful benchmark, it
really exposes the raw overhead of scheduling
the problem here it artificially generates lock
contention at a rate we would not see in
general applications.
 (b) chatroomsimilar to VolanoBenchmark, but easier to use and
measure. This gives a better idea what
the impact would be for real applications

On the progress side. Now that we already have a good idea what the MQ and
the table==priority based scheduler can do, we
want to combine them and see how that impacts performance. Next we still
have the open issues whether keeping queues in priority order makes sense
or not. That exercise should be done for both MQ and table based scheduler.

Next, we have started looking into breaking up the CPU set. Right now we
scan all CPUs to find an appropriate CPU to preempt.
For large number of CPUs that can cost particular with very few number
(1-4) of threads.
We are currently experimenting with breaking up the CPUs into smaller sets
and just schedule with in their set, i.e. we don't look beyond the set to
balance (e.g. priorities etc). Occasionally (1HZ) we run a load balancing
mechanism to redistribute work.
We have a simple prototype running demonstrating the idea.This could be
also useful for NUMA systems as well. We will post this patch over the MQ
soon on the site.