subject:"Re\: a quest for a better scheduler"



I give you a concrete example:

Running DB2 on an SMP system.

In DB2 there is a processes/thread pool that is sized based
on memory and numcpus. People tell me that the size of this pool
is in the order of 100s for an 8-way system with reasonable
sized database. These  determine the number of agents
that can simultaneously execute an SQL statement.

Requests are flying in for transactions (e.g. driven by TPC-W like
applications). The agents are grepped from the pool and concurrently
fire the SQL transactions.
Assuming that there is enough concurrency in the database, there is
no reason to believe that the majority of those active agents is
not effectively running. TPC-W loads have observed 100 of active
transactions at a time.

Ofcourse limiting the number of agents would reduce concurrently
running tasks, but would limit the responsiveness of the system.
Implementing a database in the kernel ala TUX doesn't seem to be
the right approach either (complexity, fault containment, ...)

Hope that is one example people accept.

I can dig up some information on WebSphere Applications.

I'd love to hear from some other applications that fall into
a similar category as the above and substantiate a bit the need
for 100s of running processes, without claiming that the
application is broke.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Mark Hahn <[EMAIL PROTECTED]> on 04/04/2001 02:28:42 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:
Subject:  Re: a quest for a better scheduler



> ok if the runqueue length is limited to a very small multiple of the
#cpus.
> But that is not what high end server systems encounter.

do you have some example of this in mind?  so far, noone has
actually produced an example of a "high end" server that has
long runqueues.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 10:49:04AM -0700, Kanoj Sarcar wrote:
> Imagine that most of the program's memory is on node 1, it was scheduled
> on node 2 cpu 8 momentarily (maybe because kswapd ran on node 1, other
> higher priority processes took over other cpus on node 1, etc). 
> 
> Then, your patch will try to keep the process on node 2, which is not
> neccessarily the best solution. Of course, as I mentioned before, if
> you have a node local cache on node 2, that cache might have been warmed
> enough to make scheduling on node 2 a good option. 

Exactly it made it a good option, and more important this heuristic can
only improve performance if compared to the mainline scheduler.

Infact I tried to reschedule the task back to its original node and it dropped
performance, anyways I cannot say to have done an extensive research on that, I
believe if we take care of some more history of the node migration we may
understand it's right time to push it back to its original node but that was
not obvious and I wanted a simple solution to boost the performance under CPU
bound load to start with.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler


> 
> It helps by keeping the task in the same node if it cannot keep it in
> the same cpu anymore.
> 
> Assume task A is sleeping and it last run on cpu 8 node 2. It gets a wakeup
> and it gets running and for some reason cpu 8 is busy and there are other
> cpus idle in the system. Now with the current scheduler it can be moved in any
> cpu in the system, with the numa sched applied we will try to first reschedule
> it in the idles cpus of node 2 for example. The per-node runqueue are mainly
> necessary to implement the heuristic.
>

Yes. But this is not the best solution, if I can add on to the example
and make some assumptions.

Imagine that most of the program's memory is on node 1, it was scheduled
on node 2 cpu 8 momentarily (maybe because kswapd ran on node 1, other
higher priority processes took over other cpus on node 1, etc). 

Then, your patch will try to keep the process on node 2, which is not
neccessarily the best solution. Of course, as I mentioned before, if
you have a node local cache on node 2, that cache might have been warmed
enough to make scheduling on node 2 a good option. 

I am not saying there is a wrong or right answer, there are so many
possibilities, everything probably works and breaks under different
circumstances. 

Btw, while we are swapping patches, the patch at

http://oss.sgi.com/projects/numa/download/sched242.patch

tries to implement per-arch scheduling. The current scheduler behavior
is smp_arch_goodness() and smp_pick_cpu(), but the patch allows the
possibility for a specific platform to change that to something else. 

Linus has seen this patch, and agrees to it in principle. He does not
consider this 2.4 material though. Of course, I am completely open to
Ingo (or someone else) coming up with a different way of providing the
same freedom to arch specific code.

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Paul McKenney



> Just a quick comment. Andrea, unless your machine has some hardware
> that imply pernode runqueues will help (nodelevel caches etc), I fail
> to understand how this is helping you ... here's a simple theory though.
> If your system is lightly loaded, your pernode queues are actually
> implementing some sort of affinity, making sure processes stick to
> cpus on nodes where they have allocated most of their memory on. I am
> not sure what the situation will be under huge loads though.

Exactly.  If a given task has run on a particular nodes for a while,
its memory will tend to be allocated on that node.  So preferentially
running it on another CPU on that same node should get you better
memory latency than would running it on some other node's CPUs.

In addition, continuing to run the task on a particular node means
that more of that task's memory is from that node, which again means
good memory latency.  In contrast, if you move a task back and forth
between nodes, it can end up with its memory spread over many nodes,
which means that it does not get good memory latency no matter where
you run it.

  Thanx, Paul

> As I have mentioned to some people before,
percpu/pernode/percpuset/global
> runqueues probably all have their advantages and disadvantages, and their
> own sweet spots. Wouldn't it be really neat if a system administrator
> or performance expert could pick and choose what scheduler behavior he
> wants, based on how the system is going to be used?
>
> Kanoj

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler



Correct, that's true.

Our patch does various things.
(a) limit search for a task to a admin specified set of cpu's
during schedule()..
(b) limits search for a preemptable task to another set of cpu's
during reschedule_idle()
 
(c) loadbalancing, i.e. moving from queue to queue.
Currently we balance within a set and across sets.

Obviously in NUMA one could specify
 (a) such that multiple sets fall into the same node
 no node crossings.
 (b) specify this set to at least span a node
 (c) do some intelligent moving based on memory maps
  etc.

I guess (c) would be first instance on where to plug architecture
dependent information, e.g. how much memory footprint does a task
have on a particular node and how much would the moving cost.
The loadbalance we provide is a simple sceleton to tickle you mind,
not a solution. Nevertheless, one can see it can have some impact.

See for results for various combinations of poolsizes and balancings:
http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Kanoj Sarcar <[EMAIL PROTECTED]> on 04/04/2001 01:14:28 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   [EMAIL PROTECTED] (Linux Kernel List),
  [EMAIL PROTECTED]
Subject:  Re: [Lse-tech] Re: a quest for a better scheduler



>
>
>
> Kanoj, our cpu-pooling + loadbalancing allows you to do that.
> The system adminstrator can specify at runtime through a
> /proc filesystem interface the cpu-pool-size, whether loadbalacing
> should take place.

Yes, I think this approach can support the various requirements
put on the scheduler.

I think there are two degrees of freedom that are needed in the
scheduler. One, as you say, for the sysadmin to be able to specify
what overall scheduler behavior he wants.

Secondly, from the kernel standpoint, there needs to be perarch
hooks, to be able to utilize nodelevel/multilevel caches, NUMA
aspects etc.

Kanoj



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-04 Thread Mike Kravetz

On Tue, Apr 03, 2001 at 09:21:57PM -0700, Fabio Riccardi wrote:
> I was actually suspecting that the extra lines in your patch were there for a
> reason :)
> 
> A few questions:
> 
> What is the real impact of a (slight) change in scheduling semantics?
> 
> Under which situation one should notice a difference?

I assume your questions are directed at the difference between local
and global scheduling decisions.  As with most things the impact of
these differences depends on workload.  Most multi-queue scheduler
implementations make local scheduling decisions and depend on load
balancing for system wide fairness.  Schedulers which make global
decisions handle load balancing via their global policy.

The HP multi-queue implementation you are using performs no load
balancing.  Therefore, in a worst case situation you could have
low priority tasks sharing one CPU while high priority tasks are
sharing another.  To be fair, I have talked to the HP people and
this scheduler was never intended to be a general purpose solution.
Therefore, it makes little sense to comment its merits as such.

> As you state in your papers the global decision comes with a cost,
> is it worth it?

My primary motivation for attempting to perform the same global
decisions as the current scheduler was so that meaningful comparisons
could be made.  Also, by using the same global decision policy I
was able to avoid the issue of load balancing.  In most multi-queue
implementations, load balancing algorithms take considerable effort
to get working in a reasonable well performing manner.

> 
> Could you make a port of your thing on recent kernels?

There is a 2.4.2 patch on the web page.  I'll put out a 2.4.3 patch
as soon as I get some time.

-- 
Mike Kravetz [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

Well put, this how we can eliminate searching all bins or lists and that's
how we do it under.
http://lse.sourceforge.net/scheduling/2.4.1-pre8-prioSched.

If you have a list per priority level, then you can even pick the first one
you find if its on
the same level. That's what I tried in a more recent implementation. Also
combined that
with using a bitmask to represent non-empty tasks.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003

Davide Libenzi <[EMAIL PROTECTED]>@ewok.dev.mycio.com on 04/04/2001
12:12:54 PM

Sent by:  [EMAIL PROTECTED]

To:   Ingo Molnar <[EMAIL PROTECTED]>
cc:   Linus Torvalds <[EMAIL PROTECTED]>, Alan Cox
  <[EMAIL PROTECTED]>, Linux Kernel List
  <[EMAIL PROTECTED]>, Hubertus Franke/Watson/IBM@IBMUS,
  Mike Kravetz <[EMAIL PROTECTED]>, Fabio Riccardi
  <[EMAIL PROTECTED]>
Subject:  Re: a quest for a better scheduler

On 04-Apr-2001 Ingo Molnar wrote:
>
> On Tue, 3 Apr 2001, Fabio Riccardi wrote:
>
>> I've spent my afternoon running some benchmarks to see if MQ patches
>> would degrade performance in the "normal case".
>
> no doubt priority-queue can run almost as fast as the current scheduler.
> What i'm worried about is the restriction of the 'priority' of processes,
> it cannot depend on previous processes (and other 'current state')
> anymore.
>
> to so we have two separate issues:
>
>#1: priority-queue: has the fundamental goodness() design limitation.
>
>#2: per-CPU-runqueues: changes semantics, makes scheduler less
> effective due to nonglobal decisions.
>
> about #1: while right now the prev->mm rule appears to be a tiny issue
(it
> might not affect performance significantly), but forbidding it by
> hardcoding the assumption into data structures is a limitation of
*future*
> goodness() functions. Eg. with the possible emergence of CPU-level
> threading and other, new multiprocessing technologies, this could be a
> *big* mistake.

This is not correct Ingo. I haven't seen the HP code but if You store
processes
in slots S :

S = FS( goodness(p, p->processor, p->mm) )

and You start scanning from the higher slots, as soon as you find a task
with a
goodness G' that is equal to the max goodness in slot You can choose that
process to run.
Again, if You haven't found such a goodness during the slot scan but You've
found a task with a goodness G' :

G' >= SG - DD

where :

SG = max slot goodness
DD = SG(i) - SG(i - 1)

You can select that task as the next to spin.
This was the behaviour that was implemented in my scheduler patch about 2
years
ago.
Beside this, I this that with such loads We've more serious problem to face
with inside the kernel.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 09:50:58AM -0700, Kanoj Sarcar wrote:
> > 
> > I didn't seen anything from Kanoj but I did something myself for the wildfire:
> > 
> > 
>ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1
> > 
> > this is mostly an userspace issue, not really intended as a kernel optimization
> > (however it's also partly a kernel optimization). Basically it splits the load
> > of the numa machine into per-node load, there can be unbalanced load across the
> > nodes but fairness is guaranteed inside each node. It's not extremely well
> > tested but benchmarks were ok and it is at least certainly stable.
> >
> 
> Just a quick comment. Andrea, unless your machine has some hardware
> that imply pernode runqueues will help (nodelevel caches etc), I fail 
> to understand how this is helping you ... here's a simple theory though. 

It helps by keeping the task in the same node if it cannot keep it in
the same cpu anymore.

Assume task A is sleeping and it last run on cpu 8 node 2. It gets a wakeup
and it gets running and for some reason cpu 8 is busy and there are other
cpus idle in the system. Now with the current scheduler it can be moved in any
cpu in the system, with the numa sched applied we will try to first reschedule
it in the idles cpus of node 2 for example. The per-node runqueue are mainly
necessary to implement the heuristic.

> cpus on nodes where they have allocated most of their memory on. I am 
> not sure what the situation will be under huge loads though.

after all cpus are busy we try to reschedule only on the cpus of the local
node, that's why it can generate some unbalance yes, but it will tend to
rebalance over the time because some node will end with all tasks with
zero counter first if it's less loaded, and so then it will start
getting tasks with has_cpu 0 in the runqueues out of other nodes.

You may want to give it a try on your machines and see what difference it
makes, I'd be curious to know of course.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler


> 
> 
> 
> Kanoj, our cpu-pooling + loadbalancing allows you to do that.
> The system adminstrator can specify at runtime through a
> /proc filesystem interface the cpu-pool-size, whether loadbalacing
> should take place.

Yes, I think this approach can support the various requirements
put on the scheduler.

I think there are two degrees of freedom that are needed in the
scheduler. One, as you say, for the sysadmin to be able to specify
what overall scheduler behavior he wants. 

Secondly, from the kernel standpoint, there needs to be perarch
hooks, to be able to utilize nodelevel/multilevel caches, NUMA
aspects etc.

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler




Kanoj, our cpu-pooling + loadbalancing allows you to do that.
The system adminstrator can specify at runtime through a
/proc filesystem interface the cpu-pool-size, whether loadbalacing
should take place.
We can put limiting to the local cpu-set during reschedule_idle
back into the code, to make it complete and compatible with
the approach that Andrea has taken.
This way, one can fully isolate or combine cpu-sets.

here is the code for the pooling.
http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool

loadbalancing and /proc system combined in this module.
http://lse.sourceforge.net/scheduling/LB/loadbalance.c

a writeup explaining this concept is available under
http://lse.sourceforge.net/scheduling/LB/poolMQ.html

Prerequisite is the MQ scheduler...
http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched

We need to update these for 2.4.3  (coming)

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Kanoj Sarcar <[EMAIL PROTECTED]>@lists.sourceforge.net on
04/04/2001 12:50:58 PM

Sent by:  [EMAIL PROTECTED]


To:   [EMAIL PROTECTED] (Andrea Arcangeli)
cc:   [EMAIL PROTECTED] (Ingo Molnar), Hubertus Franke/Watson/IBM@IBMUS,
  [EMAIL PROTECTED] (Mike Kravetz), [EMAIL PROTECTED] (Fabio
  Riccardi), [EMAIL PROTECTED] (Linux Kernel List),
  [EMAIL PROTECTED]
Subject:  Re: [Lse-tech] Re: a quest for a better scheduler



>
> I didn't seen anything from Kanoj but I did something myself for the
wildfire:
>
>
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

>
> this is mostly an userspace issue, not really intended as a kernel
optimization
> (however it's also partly a kernel optimization). Basically it splits the
load
> of the numa machine into per-node load, there can be unbalanced load
across the
> nodes but fairness is guaranteed inside each node. It's not extremely
well
> tested but benchmarks were ok and it is at least certainly stable.
>

Just a quick comment. Andrea, unless your machine has some hardware
that imply pernode runqueues will help (nodelevel caches etc), I fail
to understand how this is helping you ... here's a simple theory though.
If your system is lightly loaded, your pernode queues are actually
implementing some sort of affinity, making sure processes stick to
cpus on nodes where they have allocated most of their memory on. I am
not sure what the situation will be under huge loads though.

As I have mentioned to some people before, percpu/pernode/percpuset/global
runqueues probably all have their advantages and disadvantages, and their
own sweet spots. Wouldn't it be really neat if a system administrator
or performance expert could pick and choose what scheduler behavior he
wants, based on how the system is going to be used?

Kanoj

___
Lse-tech mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lse-tech



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 09:39:23AM -0700, Kanoj Sarcar wrote:
> example, for NUMA, we need to try hard to schedule a thread on the 
> node that has most of its memory (for no reason other than to decrease
> memory latency). Independently, some NUMA machines build in multilevel 
> caches and local snoops that also means that specific processors on
> the same node as the last_processor are also good candidates to run 
> the process next.

yes. That will probably need to be optional and choosen by the architecture
at compile time too. The probably most important factor to consider is the
penality of accessing remote memory, I think I can say on all recent and future
machines with a small difference between local and remote memory (and possibly
as you say with a decent cache protocol able to snoop cacheline data from the
other cpus even if they're not dirty) it's much better to always try to keep
the task in its last node. My patch is actually assuming recent machines and it
keeps the task in its last node if not in the last cpu and it keeps doing
memory allocation from there and it forgets about its original node where it
started allocating the memory from.  This provided the best performance during
userspace CPU bound load as far I can tell and it also better distribute the load.

Kanoj could you also have a look at the NUMA related common code MM fixes I did
in this patch? I'd like to get them integrated (just skip the arch/alpha/*
include/asm-alpha/* stuff while reading the patch, they're totally orthogonal).

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/00_alpha-numa-1

If you prefer I can extract them in a more finegrinded patch just dropping
the alpha stuff by hand.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler


> 
> I didn't seen anything from Kanoj but I did something myself for the wildfire:
> 
>   
>ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1
> 
> this is mostly an userspace issue, not really intended as a kernel optimization
> (however it's also partly a kernel optimization). Basically it splits the load
> of the numa machine into per-node load, there can be unbalanced load across the
> nodes but fairness is guaranteed inside each node. It's not extremely well
> tested but benchmarks were ok and it is at least certainly stable.
>

Just a quick comment. Andrea, unless your machine has some hardware
that imply pernode runqueues will help (nodelevel caches etc), I fail 
to understand how this is helping you ... here's a simple theory though. 
If your system is lightly loaded, your pernode queues are actually 
implementing some sort of affinity, making sure processes stick to 
cpus on nodes where they have allocated most of their memory on. I am 
not sure what the situation will be under huge loads though.

As I have mentioned to some people before, percpu/pernode/percpuset/global
runqueues probably all have their advantages and disadvantages, and their
own sweet spots. Wouldn't it be really neat if a system administrator
or performance expert could pick and choose what scheduler behavior he
wants, based on how the system is going to be used?

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

> 
> 
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
> 
> > Another point to raise is that the current scheduler does a exhaustive
> > search for the "best" task to run. It touches every process in the
> > runqueue. this is ok if the runqueue length is limited to a very small
> > multiple of the #cpus. [...]
> 
> indeed. The current scheduler handles UP and SMP systems, up to 32
> (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
> approach anyway in many other subsystems too, Kanoj is doing some
> scheduler work in that area.

Actually, not _much_ work has been done in this area. Alongwith a bunch
of other people, I have some ideas about what needs to be done. For 
example, for NUMA, we need to try hard to schedule a thread on the 
node that has most of its memory (for no reason other than to decrease
memory latency). Independently, some NUMA machines build in multilevel 
caches and local snoops that also means that specific processors on
the same node as the last_processor are also good candidates to run 
the process next.

To handle a single layer of shared caches, I have tried certain simple
things, mostly as hacks, but am not pleased with the results yet. More
testing needed.

Kanoj

> 
> but the original claim was that the scheduling of thousands of runnable
> processes (which is not equal to having thousands of sleeping processes)
> must perform well - which is a completely different issue.
> 
>   Ingo
> 
> 
> ___
> Lse-tech mailing list
> [EMAIL PROTECTED]
> http://lists.sourceforge.net/lists/listinfo/lse-tech
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-04 Thread Davide Libenzi



On 04-Apr-2001 Ingo Molnar wrote:
> 
> On Tue, 3 Apr 2001, Fabio Riccardi wrote:
> 
>> I've spent my afternoon running some benchmarks to see if MQ patches
>> would degrade performance in the "normal case".
> 
> no doubt priority-queue can run almost as fast as the current scheduler.
> What i'm worried about is the restriction of the 'priority' of processes,
> it cannot depend on previous processes (and other 'current state')
> anymore.
> 
> to so we have two separate issues:
> 
>#1: priority-queue: has the fundamental goodness() design limitation.
> 
>#2: per-CPU-runqueues: changes semantics, makes scheduler less
> effective due to nonglobal decisions.
> 
> about #1: while right now the prev->mm rule appears to be a tiny issue (it
> might not affect performance significantly), but forbidding it by
> hardcoding the assumption into data structures is a limitation of *future*
> goodness() functions. Eg. with the possible emergence of CPU-level
> threading and other, new multiprocessing technologies, this could be a
> *big* mistake.

This is not correct Ingo. I haven't seen the HP code but if You store processes
in slots S :

S = FS( goodness(p, p->processor, p->mm) )

and You start scanning from the higher slots, as soon as you find a task with a
goodness G' that is equal to the max goodness in slot You can choose that
process to run.
Again, if You haven't found such a goodness during the slot scan but You've
found a task with a goodness G' :

G' >= SG - DD

where :

SG = max slot goodness
DD = SG(i) - SG(i - 1)

You can select that task as the next to spin.
This was the behaviour that was implemented in my scheduler patch about 2 years
ago.
Beside this, I this that with such loads We've more serious problem to face
with inside the kernel.



- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Christoph Hellwig


On Wed, Apr 04, 2001 at 09:44:22AM -0600, Khalid Aziz wrote:
> Let me stress that HP scheduler is not meant to be a replacement for the
> current scheduler. The HP scheduler patch allows the current scheduler
> to be replaced by another scheduler by loading a module in special
> cases.

HP also has a simple mq patch that is _not_ integrated into the pluggable
scheduler framework, I have used it myself.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

Andrea Arcangeli wrote:
> 
> On Wed, Apr 04, 2001 at 10:03:10AM -0400, Hubertus Franke wrote:
> > I understand the dilemma that the Linux scheduler is in, namely satisfy
> > the low end at all cost. [..]
> 
> We can satisfy the low end by making the numa scheduler at compile time (that's
> what I did in my patch at least).
> 
> Andrea

I fully agree with this approach. It would be very hard to design a
scheduler that performs equally well on a UP machine running couple of
processes and a NUMA machine. These two cases represent the two ends of
spectrum. The two schedulers should be separate IMO and one of them
should be selected at compile time.

--
Khalid

Khalid Aziz Linux Development Laboratory
(970)898-9214Hewlett-Packard
[EMAIL PROTECTED]Fort Collins, CO
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

Hubertus Franke wrote:
> 
> This is an important point that Mike is raising and it also addresses a
> critique that Ingo issued yesterday, namely interactivity and fairness.
> The HP scheduler completely separates the per-CPU runqueues and does
> not take preemption goodness or alike into account. This can lead to
> unfair proportionment of CPU cycles, strong priority inversion and a
> potential
> lack of interactivity.
> 
> Our MQ scheduler does yield the same decision in most cases
> (other than defined by some race condition on locks and counter members)
> 

Let me stress that HP scheduler is not meant to be a replacement for the
current scheduler. The HP scheduler patch allows the current scheduler
to be replaced by another scheduler by loading a module in special
cases. HP is providing three different loadable scheduler modules -
Processor sets, Constant time scheduler, and Multi-runqueue scheduler.
Each one of these is geared towards a specific requirement. I would not
suggest using any of these for a generalized case. Processor sets
scheduler is designed to make scheduling decisions on a per-cpu basis
and not global basis. All we are trying to do is to make the current
scheduler modular so we CAN load an alternate scheduling policy module
in cases where the process mix requires a different scheduling policy or
the site policy require a different scheduling policy. An example of a
specific site processor allocation policy could be a site that runs a
database server on an MP machine along with a few other processes and
the administrator wants to guarantee that the database server process
always gets x percent of processing time or one CPU be dedicated to just
the database server. A policy like this is not meant to be fair and of
course, not a policy we want to impose upon others. The only HP changes
I would put in the kernel sources for general release would be the
changes to scheduler to allow such alternate (not necessarily fair or
the fastest for benchmarks, general process mix or 1000's of processes)
policies to be loaded. When a policy module is not loaded, scheduler
works exactly like the current scheduler even after HP patches. There
are people who could benefit from being able to load alternate policy
schedules. Fabio Ricardi happens to be one of them :-) Anyone who does
not want to even allow an alternate scheduler module to be loaded can
simply compile the alternate scheduler support out and that is how I
would expect most kernels to be compiled, especially the ones that ship
with various distributions. I would like the decision to include support
for alternate scheduler to be made by sys admins themselves for their
individual cases.

--
Khalid

Khalid Aziz Linux Development Laboratory
(970)898-9214Hewlett-Packard
[EMAIL PROTECTED]Fort Collins, CO
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

You imply that high end means thousands of processes,
simply because we have shown that in our graphs as an
asymptotic end.
No, it could mean 5*#cpus and that is not all that absurd.
This could happen with a spike in demand.

TUX is not the greatest example to use, because it does
static webpage serving and is hence tied into the
file cache. If you move up the food chain, where middleware
is active and things are a bit more complicated than
sending stuff out of the filecache, having a bunch of threads
hanging around to deal with the spike in demand is common
practive, although you think its bad.

Now coming again to MQ (forget about priority list from
now on).

When we scan either the local or the realtime queues we do
use goodness value. So we have the same flexibility as the
current scheduler if it comes to goodness() flexibility and
future improvements.

For remote stealing we do use na_goodness to compare
for a better process to steal. Hence we would loose the "+1"
information here, nevertheless, once a decision
has been made, we still use preemption verification with goodness.
Eitherway, being off by "+1" for regular tasks once in a while
is no big deal, because this problem already exists today.
While walking the runqueue, another processor can either update
the counter value of task (ok that's covered by can_schedule)
or can run recalculate, which makes the decision that one is
about to make wrong from the point of always running the best.
But that's ok, because counter, nice etc. are approximations
anyway. Through in PROC_CHANGE_PENALTY and you have a few knobs
that are used to control interactivity and throughput.

If you look at some of the results with our reflex benchmark.
For the low thread count we basically show improved performance
on the 2,4,5,6,7,8 way system if #threads < #cpus, they all show
improvements. Look at the following numbers running the
reflex benchmark, left most columns are number of threads, with
typically 1/2 of them runnable.
You can clearly see that the priority list suffers from overhead,
but MQ is beating vanilla pretty much everywhere. Again, this
is due because of rapid scheduler invocation and resulting
lock contention.

  2-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
46.725  4.691  7.387
86.326  4.766  6.421
12   6.838  5.233  6.431
16   7.13  5.415  7.29

  4-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
49.42  7.873  10.592
88.143  3.799  8.691
12   7.877  3.537  8.101
16   7.688  2.953  7.144

  6-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
49.595  7.88  10.358
89.703  7.278  10.523
12   10.0164.652  10.985
16   9.882  3.629  10.525

  8-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
49.804  8.033  10.548
810.4365.783  11.475
12   10.9256.787  11.646
16   11.4265.048  11.877
20   11.4383.895  11.633
24   11.4573.304  11.347
28   11.4953.073  11.09
32   11.53  2.944  10.898

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003

Ingo Molnar <[EMAIL PROTECTED]>@elte.hu> on 04/04/2001 09:23:34 AM

Please respond to <[EMAIL PROTECTED]>

Sent by:  <[EMAIL PROTECTED]>

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   Mike Kravetz <[EMAIL PROTECTED]>, Fabio Riccardi
  <[EMAIL PROTECTED]>, Linux Kernel List
  <[EMAIL PROTECTED]>
Subject:  Re: a quest for a better scheduler

On Wed, 4 Apr 2001, Hubertus Franke wrote:

> I understand the dilemma that the Linux scheduler is in, namely
> satisfy the low end at all cost. [...]

nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
You are playing word games by suggesting that the current behavior prefers
'low end'. 'thousands of runnable processes' is not 'high end' at all,
it's 'broken end'. Thousands of runnable processes are the sign of a
broken application design, and 'fixing' the scheduler to perform better in
that case is just fixing the symptom. [changing the scheduler to perform
better in such situations is possible too, but all solutions proposed so
far had strings attached.]

 Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

Yes, Andrea.

We actually already went a step further. We treat the scheduler
as a single entity, rather than splitting it up.

Based on the MQ scheduler we do the balancing across all nodes
at reschedule_idle time. We experimented to see whether only looking
for idle tasks remotely is a good idea, but it bloats the code.
Local scheduling decisions are limited to a set of cpus, which
could coincide with the cpus of one node, or if desirable on
smaller granularities.

In addition we implemented a global load balancing scheme that
ensures that load is equally distributed across all run queues.
This is made a loadable module, so you can plug in what ever
you want.

I grant in NUMA it might actually be desirable to separate
schedulers completely (we can do that trivially in reschedule_idle),
but you need loadbalancing at some point.

Here is the list of patches:

MultiQueue Scheduler: http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched
Pooling Extension:http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool
LoadBalancing:
http://lse.sourceforge.net/scheduling/LB/loadbalance.c

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003

Andrea Arcangeli <[EMAIL PROTECTED]> on 04/04/2001 11:08:47 AM

To:   Ingo Molnar <[EMAIL PROTECTED]>
cc:   Hubertus Franke/Watson/IBM@IBMUS, Mike Kravetz
  <[EMAIL PROTECTED]>, Fabio Riccardi <[EMAIL PROTECTED]>, Linux
  Kernel List <[EMAIL PROTECTED]>,
  [EMAIL PROTECTED]
Subject:  Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote:
>
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
>
> > Another point to raise is that the current scheduler does a exhaustive
> > search for the "best" task to run. It touches every process in the
> > runqueue. this is ok if the runqueue length is limited to a very small
> > multiple of the #cpus. [...]
>
> indeed. The current scheduler handles UP and SMP systems, up to 32
> (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
> approach anyway in many other subsystems too, Kanoj is doing some
> scheduler work in that area.

I didn't seen anything from Kanoj but I did something myself for the
wildfire:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

this is mostly an userspace issue, not really intended as a kernel
optimization
(however it's also partly a kernel optimization). Basically it splits the
load
of the numa machine into per-node load, there can be unbalanced load across
the
nodes but fairness is guaranteed inside each node. It's not extremely well
tested but benchmarks were ok and it is at least certainly stable.

However Ingo consider that in a 32-way if you don't have at least 32 tasks
running all the time _always_ you're really stupid paying such big money
for
nothing ;). So the fact the scheduler is optimized for 1/2 tasks running
all
the time is not nearly enough for those machines (and of course also the
scheduling rate automatically increases linearly with the increase of the
number of cpus). Now it's perfectly fine that we don't ask the embedded and
desktop guys to pay for that, but a kernel configuration option to select
an
algorithm that scales would be a good idea IMHO. The above patch just adds
a
CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will
be
hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles).

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler


On Wed, Apr 04, 2001 at 10:03:10AM -0400, Hubertus Franke wrote:
> I understand the dilemma that the Linux scheduler is in, namely satisfy
> the low end at all cost. [..]

We can satisfy the low end by making the numa scheduler at compile time (that's
what I did in my patch at least).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote:
> 
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
> 
> > Another point to raise is that the current scheduler does a exhaustive
> > search for the "best" task to run. It touches every process in the
> > runqueue. this is ok if the runqueue length is limited to a very small
> > multiple of the #cpus. [...]
> 
> indeed. The current scheduler handles UP and SMP systems, up to 32
> (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
> approach anyway in many other subsystems too, Kanoj is doing some
> scheduler work in that area.

I didn't seen anything from Kanoj but I did something myself for the wildfire:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

this is mostly an userspace issue, not really intended as a kernel optimization
(however it's also partly a kernel optimization). Basically it splits the load
of the numa machine into per-node load, there can be unbalanced load across the
nodes but fairness is guaranteed inside each node. It's not extremely well
tested but benchmarks were ok and it is at least certainly stable.

However Ingo consider that in a 32-way if you don't have at least 32 tasks
running all the time _always_ you're really stupid paying such big money for
nothing ;). So the fact the scheduler is optimized for 1/2 tasks running all
the time is not nearly enough for those machines (and of course also the
scheduling rate automatically increases linearly with the increase of the
number of cpus). Now it's perfectly fine that we don't ask the embedded and
desktop guys to pay for that, but a kernel configuration option to select an
algorithm that scales would be a good idea IMHO. The above patch just adds a
CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will be
hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Wed, 4 Apr 2001, Hubertus Franke wrote:

> It is not clear that yielding the same decision as the current
> scheduler is the ultimate goal to shoot for, but it allows
> comparision.

obviously the current scheduler is not cast into stone, it never was,
never will be.

but determining whether the current behavior is possible in a different
scheduler design is sure a good metric of how flexible that different
scheduler design is.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Wed, 4 Apr 2001, Hubertus Franke wrote:

> I understand the dilemma that the Linux scheduler is in, namely
> satisfy the low end at all cost. [...]

nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
You are playing word games by suggesting that the current behavior prefers
'low end'. 'thousands of runnable processes' is not 'high end' at all,
it's 'broken end'. Thousands of runnable processes are the sign of a
broken application design, and 'fixing' the scheduler to perform better in
that case is just fixing the symptom. [changing the scheduler to perform
better in such situations is possible too, but all solutions proposed so
far had strings attached.]

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

I grant you that the code is not as clean as the current scheduler,
so maybe you missed that part.

For the priority scheduler:
Yes the task_to_qid assumes a NON-affinity (no cpu, no mm) to determine
the list index to where the task has to be enqueued.
However, if you wonder down to the search_range section of the scheduler,
you see that we  do add the "+1" for equal mm. All I did here was take
the goodness() function a part for search_range in order to insert some
extra code that speeds up the code.

Again, I don't want to lean to hard on the priority scheduler, because
we only did this to have a reference point regarding lock contention and
lock hold. This is stuff that many operating systems did years ago, most
of which have moved on to add MQ to that. So that is what the LSE
scheduling team is pushing.

I understand the dilemma that the Linux scheduler is in, namely satisfy
the low end at all cost. But I believe that in order for Linux to push
into the high end we need to address the scalability of the current
scheduler.
Simply handwaving and declaring that lots of running tasks is a stupid
idea doesn't get us there.
If indeed you assume that there #running-tasks ~ #cpus then each task
should alot a reasonable amount of work and any small overhead incurred
should be amortizable. However as we contend if #running-tasks >> #cpus
is a situation we need to deal with, the living with the current lock
contention can really drag us down.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003

Ingo Molnar <[EMAIL PROTECTED]>@elte.hu> on 04/04/2001 02:28:31 AM

Please respond to <[EMAIL PROTECTED]>

Sent by:  <[EMAIL PROTECTED]>

To:   Mike Kravetz <[EMAIL PROTECTED]>
cc:   Hubertus Franke/Watson/IBM@IBMUS, Fabio Riccardi
  <[EMAIL PROTECTED]>, Linux Kernel List
  <[EMAIL PROTECTED]>
Subject:  Re: a quest for a better scheduler

On Tue, 3 Apr 2001, Mike Kravetz wrote:

> Our 'priority queue' implementation uses almost the same goodness
> function as the current scheduler.  The main difference between our
> 'priority queue' scheduler and the current scheduler is the structure
> of the runqueue.  We break up the single runqueue into a set of
> priority based runqueues. The 'goodness' of a task determines what
> sub-queue a task is placed in.  Tasks with higher goodness values are
> placed in higher priority queues than tasks with lower goodness
> values. [...]

we are talking about the same thing, re-read my mail. this design assumes
that 'goodness' is constant in the sense that it does not depend on the
previous process.

and no, your are not using the same goodness() function, you omitted the
prev->mm == next->mm component to goodness(), due to this design
limitation.

 Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

This is an important point that Mike is raising and it also addresses a
critique that Ingo issued yesterday, namely interactivity and fairness.
The HP scheduler completely separates the per-CPU runqueues and does
not take preemption goodness or alike into account. This can lead to
unfair proportionment of CPU cycles, strong priority inversion and a
potential
lack of interactivity.

Our MQ scheduler does yield the same decision in most cases
(other than defined by some race condition on locks and counter members)

It is not clear that yielding the same decision as the current scheduler
is the ultimate goal to shoot for, but it allows comparision.

Another point to raise is that the current scheduler does a exhaustive
search
for the "best" task to run. It touches every process in the runqueue. this
is
ok if the runqueue length is limited to a very small multiple of the #cpus.
But that is not what high end server systems encounter.

With the rising number of processors, lock contention can quickly become
a bottleneck. If we assume that load (#running-task) increases somewhat
linear with #cpus, the problem gets even worth because not only have I
increased the number of clients but also the lock hold time.

Clinging on to the statement that #running-tasks ~ #cpus, ofcourse saves
you from that dilemma, but not everybody is signing on to this limitation.

MQ and priority-list help in 2 ways.

MQ reduces the average lock holdtime because on average it only inspects
#running-tasks/#cpus tasks to make a local decision. It then goes on to
inspect (#cpus-1) datastructures representing the next best to run tasks
on those remote cpus all without holding the lock, thus substantially
reducing lock contention. Note we still search the entire set of runnable
tasks, however we do it in a distributed collaborative manner.
The only time we deviate from the current scheduler decision is in the
case when two cpus have identified the same remote task as a target for
remote stealing. In that case one will win and the other cpu will continue
looking somewhere else, although there might have been another tasks
on that cpu to steal.

priority list schedulers (PRS) only helps in reducing lock hold time,
which can result in some relieve wrt lock contention, but not a whole lot.
PRS can limit the lists it has to search based on the PROC_CHANGE_PENALTY.
It also keeps 0-counter in a list that is never inspected. One can
even go further and put YIELD tasks in a separate list, given that the
sys_sched_yield already does some optimizations.
The older version (12/00) posted on LSE is functionally equivalent to the
current scheduler.
I will put up another version this week that is based on a bitmask and
which is a bit more agressive in that it ignores the MM goodness boost of 1
which in my books is merely a tie breaker between two task of equal
goodness.

Beyond that we have done some work on cpu pooling, which is to identify
a set of cpus that form a scheduling set. We still consider in
reschedule_idle all cpus for preemption but in schedule it is sufficient
to only schedule within the own set. That again can limit lock hold time
with MQ and we have seen some improvements. We also deploy load balacing.

To summarize, we have taken great care of trying to preserve the current
scheduler semantics and have laid out a path to relax some of the semantics
for further improvements.
I don't believe that the HP scheduler is sufficient since it is lacking
load balacing, which naturally occurs in our MQ scheduler, and it lacks
the interactivity requirements that Ingo pointed out.

Most of these things are discussed in great detail in the writeups under
lse.sourceforge.net/scheduling. I also believe we show there that the
MQ performance for low thread counts is also matching the vanilla case.

I further don't understand the obsession of keeping the scheduler simple.
If there are improvements and I don't believe the MQ is all that
complicated
and its well documented, why not put it in.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003

Mike Kravetz <[EMAIL PROTECTED]> on 04/03/2001 10:47:00 PM

To:   Fabio Riccardi <[EMAIL PROTECTED]>
cc:   Mike Kravetz <[EMAIL PROTECTED]>, Ingo Molnar <[EMAIL PROTECTED]>,
  Hubertus Franke/Watson/IBM@IBMUS, Linux Kernel List
  <[EMAIL PROTECTED]>, Alan Cox <[EMAIL PROTECTED]>
Subject:  Re: a quest for a better scheduler

On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote:
>
> I have measured the HP and not the "scalability" patch because the two do
more
> or less the same thing and give me the same performance advantages, but
the
> former is a lot simpler and I could port it with no effort on any recent
> kernel.

Actually, there is a significant difference between the HP patch and
the one I de

Re: a quest for a better scheduler

On Wed, 4 Apr 2001, Alan Cox wrote:

> The problem has always been - alternative scheduler, crappier
> performance for 2 tasks running (which is most boxes). [...]

it's not only the 2-task case, but also less flexibility or lost
semantics.

> Indeed. I'd love to see you beat tux entirely in userspace. It proves
> the rest of the API for the kernel is right

well, until the cost of entry into the kernel is eliminated, this is not
possible - unless there are performance bugs in TUX :-)

but yes, getting a userspace solution that gets 'close enough' in eg.
SPECweb99 benchmarks (which is complex enough to be trusted as a generic
performance metric) would be a nice thing to have. There are existing
SIGIO based, multithreaded solutions (eg. phttpd), with varying success.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Tue, 3 Apr 2001, Fabio Riccardi wrote:

> I agree that a better threading model would surely help in a web
> server, but to me this is not an excuse to live up with a broken
> scheduler.

believe me, there are many other parts of the kernel that are not
optimized for the nutcase. In this case it's the scheduler that punishes
you first. Do not shoot the messenger.

> The X15 server I'm working on now is a sort of user-space TUX, it uses
> only 8 threads per CPU and it achieves the same level of performance
> of the kernel space TUX. Even in this case the performance advantage
> of the multiqueue scheduler is remarkable, especially on a multi-CPU
> (> 2) architecture.

then your design is still bad. TUX has no problems with the scheduler, and
TUX doesnt have many runnable processes at once. And this has nothing to
do with TUX being within the kernel.

> To achieve some decent disk/CPU/network overlapping with the current
> linux blocking disk IO limitations there is no way to avoid a "bunch
> of server threads". [...]

false.

> [...] I've (experimentally) figured out that 8-16 threads per CPU can
> assure some reasonable overlapping, depending on the memory size and
> disk subsystem speed. On a 8-way machine this means 64-128 active
> tasks, [...]

if they are doing IO *only* then there wont be 64-128 active tasks. This
is how TUX does async IO: helper threads do the IO and *only* the IO. This
way cache locality is maximized. (the scheduler is irrelevant in this
case.) Again you are blaming the scheduler - the poor scheduler is simply
the first kernel subsystem that tells you: your design still sucks. (if
you have 64-128 active tasks then you already pay a very big cache
footprint price as well.) [I'm wondering how your solution was just as
fast as TUX, although you claim that the scheduler was already killing
things. Perhaps your test was done over localhost or was saturating
network bandwidth?]

> Unless we want to maintain the position tha the only way to achieve
> good performance is to embed server applications in the kernel, [...]

FYI, TUX related functionality is constantly being pushed out and made
accessible to userspace as well. Witness zerocopy, IRQ-affinity, and tons
of generic patches such as pagecache-scalability and timer-scalability.
(and soon the remaining, TUX-private improvements too.) TUX is intended to
be a testbed for kernel subsystems, where performance optimizations and
APIs can be tested without having to export them to userspace. (exporting
to userspace cannot be done lightly and makes things less flexible.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Tue, 3 Apr 2001, Fabio Riccardi wrote:

> I've spent my afternoon running some benchmarks to see if MQ patches
> would degrade performance in the "normal case".

no doubt priority-queue can run almost as fast as the current scheduler.
What i'm worried about is the restriction of the 'priority' of processes,
it cannot depend on previous processes (and other 'current state')
anymore.

to so we have two separate issues:

#1: priority-queue: has the fundamental goodness() design limitation.

#2: per-CPU-runqueues: changes semantics, makes scheduler less
effective due to nonglobal decisions.

about #1: while right now the prev->mm rule appears to be a tiny issue (it
might not affect performance significantly), but forbidding it by
hardcoding the assumption into data structures is a limitation of *future*
goodness() functions. Eg. with the possible emergence of CPU-level
threading and other, new multiprocessing technologies, this could be a
*big* mistake.

the Linux scheduler is not designed for the case of 1000 runnable
processes.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Tue, 3 Apr 2001, Mike Kravetz wrote:

> Our 'priority queue' implementation uses almost the same goodness
> function as the current scheduler.  The main difference between our
> 'priority queue' scheduler and the current scheduler is the structure
> of the runqueue.  We break up the single runqueue into a set of
> priority based runqueues. The 'goodness' of a task determines what
> sub-queue a task is placed in.  Tasks with higher goodness values are
> placed in higher priority queues than tasks with lower goodness
> values. [...]

we are talking about the same thing, re-read my mail. this design assumes
that 'goodness' is constant in the sense that it does not depend on the
previous process.

and no, your are not using the same goodness() function, you omitted the
prev->mm == next->mm component to goodness(), due to this design
limitation.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-04 Thread alad

If we are facing these problems for "normal case" then hope the Solaris is
handling it !!

Amol

Fabio Riccardi <[EMAIL PROTECTED]> on 04/04/2001 07:03:57 AM

To:   Alan Cox <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: a quest for a better scheduler

Alan,

for the "normal case" performance see my other message.

I agree that a better threading model would surely help in a web server, but to
me this is not an excuse to live up with a broken scheduler.

The X15 server I'm working on now is a sort of user-space TUX, it uses only 8
threads per CPU and it achieves the same level of performance of the kernel
space TUX. Even in this case the performance advantage of the multiqueue
scheduler is remarkable, especially on a multi-CPU (> 2) architecture.

To achieve some decent disk/CPU/network overlapping with the current linux
blocking disk IO limitations there is no way to avoid a "bunch of server
threads". I've (experimentally) figured out that 8-16 threads per CPU can
assure some reasonable overlapping, depending on the memory size and disk
subsystem speed. On a 8-way machine this means 64-128 active tasks, a total
disaster with the current scheduler.

Unless we want to maintain the position tha the only way to achieve good
performance is to embed server applications in the kernel, some minimal help
should be provided to goodwilling user applications :)

TIA, ciao,

 - Fabio

Alan Cox wrote:

> > Is there any special reason why any of those patches didn't make it to
> > the mainstream kernel code?
>
> All of them are worse for the normal case. Also 1500 running apache's isnt
> a remotely useful situation, you are thrashing the cache even if you are now
> not thrashing the scheduler. Use an httpd designed for that situation. Then
> you can also downgrade to a cheap pentium class box for the task ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-04 Thread alad




If we are facing these problems for "normal case" then hope the Solaris is
handling it !!

Amol





Fabio Riccardi [EMAIL PROTECTED] on 04/04/2001 07:03:57 AM

To:   Alan Cox [EMAIL PROTECTED]
cc:   [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: a quest for a better scheduler




Alan,

for the "normal case" performance see my other message.

I agree that a better threading model would surely help in a web server, but to
me this is not an excuse to live up with a broken scheduler.

The X15 server I'm working on now is a sort of user-space TUX, it uses only 8
threads per CPU and it achieves the same level of performance of the kernel
space TUX. Even in this case the performance advantage of the multiqueue
scheduler is remarkable, especially on a multi-CPU ( 2) architecture.

To achieve some decent disk/CPU/network overlapping with the current linux
blocking disk IO limitations there is no way to avoid a "bunch of server
threads". I've (experimentally) figured out that 8-16 threads per CPU can
assure some reasonable overlapping, depending on the memory size and disk
subsystem speed. On a 8-way machine this means 64-128 active tasks, a total
disaster with the current scheduler.

Unless we want to maintain the position tha the only way to achieve good
performance is to embed server applications in the kernel, some minimal help
should be provided to goodwilling user applications :)

TIA, ciao,

 - Fabio

Alan Cox wrote:

  Is there any special reason why any of those patches didn't make it to
  the mainstream kernel code?

 All of them are worse for the normal case. Also 1500 running apache's isnt
 a remotely useful situation, you are thrashing the cache even if you are now
 not thrashing the scheduler. Use an httpd designed for that situation. Then
 you can also downgrade to a cheap pentium class box for the task ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler



On Tue, 3 Apr 2001, Fabio Riccardi wrote:

 I've spent my afternoon running some benchmarks to see if MQ patches
 would degrade performance in the "normal case".

no doubt priority-queue can run almost as fast as the current scheduler.
What i'm worried about is the restriction of the 'priority' of processes,
it cannot depend on previous processes (and other 'current state')
anymore.

to so we have two separate issues:

#1: priority-queue: has the fundamental goodness() design limitation.

#2: per-CPU-runqueues: changes semantics, makes scheduler less
effective due to nonglobal decisions.

about #1: while right now the prev-mm rule appears to be a tiny issue (it
might not affect performance significantly), but forbidding it by
hardcoding the assumption into data structures is a limitation of *future*
goodness() functions. Eg. with the possible emergence of CPU-level
threading and other, new multiprocessing technologies, this could be a
*big* mistake.

the Linux scheduler is not designed for the case of 1000 runnable
processes.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler



On Tue, 3 Apr 2001, Fabio Riccardi wrote:

 I agree that a better threading model would surely help in a web
 server, but to me this is not an excuse to live up with a broken
 scheduler.

believe me, there are many other parts of the kernel that are not
optimized for the nutcase. In this case it's the scheduler that punishes
you first. Do not shoot the messenger.

 The X15 server I'm working on now is a sort of user-space TUX, it uses
 only 8 threads per CPU and it achieves the same level of performance
 of the kernel space TUX. Even in this case the performance advantage
 of the multiqueue scheduler is remarkable, especially on a multi-CPU
 ( 2) architecture.

then your design is still bad. TUX has no problems with the scheduler, and
TUX doesnt have many runnable processes at once. And this has nothing to
do with TUX being within the kernel.

 To achieve some decent disk/CPU/network overlapping with the current
 linux blocking disk IO limitations there is no way to avoid a "bunch
 of server threads". [...]

false.

 [...] I've (experimentally) figured out that 8-16 threads per CPU can
 assure some reasonable overlapping, depending on the memory size and
 disk subsystem speed. On a 8-way machine this means 64-128 active
 tasks, [...]

if they are doing IO *only* then there wont be 64-128 active tasks. This
is how TUX does async IO: helper threads do the IO and *only* the IO. This
way cache locality is maximized. (the scheduler is irrelevant in this
case.) Again you are blaming the scheduler - the poor scheduler is simply
the first kernel subsystem that tells you: your design still sucks. (if
you have 64-128 active tasks then you already pay a very big cache
footprint price as well.) [I'm wondering how your solution was just as
fast as TUX, although you claim that the scheduler was already killing
things. Perhaps your test was done over localhost or was saturating
network bandwidth?]

 Unless we want to maintain the position tha the only way to achieve
 good performance is to embed server applications in the kernel, [...]

FYI, TUX related functionality is constantly being pushed out and made
accessible to userspace as well. Witness zerocopy, IRQ-affinity, and tons
of generic patches such as pagecache-scalability and timer-scalability.
(and soon the remaining, TUX-private improvements too.) TUX is intended to
be a testbed for kernel subsystems, where performance optimizations and
APIs can be tested without having to export them to userspace. (exporting
to userspace cannot be done lightly and makes things less flexible.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler



On Wed, 4 Apr 2001, Alan Cox wrote:

 The problem has always been - alternative scheduler, crappier
 performance for 2 tasks running (which is most boxes). [...]

it's not only the 2-task case, but also less flexibility or lost
semantics.

 Indeed. I'd love to see you beat tux entirely in userspace. It proves
 the rest of the API for the kernel is right

well, until the cost of entry into the kernel is eliminated, this is not
possible - unless there are performance bugs in TUX :-)

but yes, getting a userspace solution that gets 'close enough' in eg.
SPECweb99 benchmarks (which is complex enough to be trusted as a generic
performance metric) would be a nice thing to have. There are existing
SIGIO based, multithreaded solutions (eg. phttpd), with varying success.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler




This is an important point that Mike is raising and it also addresses a
critique that Ingo issued yesterday, namely interactivity and fairness.
The HP scheduler completely separates the per-CPU runqueues and does
not take preemption goodness or alike into account. This can lead to
unfair proportionment of CPU cycles, strong priority inversion and a
potential
lack of interactivity.

Our MQ scheduler does yield the same decision in most cases
(other than defined by some race condition on locks and counter members)

It is not clear that yielding the same decision as the current scheduler
is the ultimate goal to shoot for, but it allows comparision.

Another point to raise is that the current scheduler does a exhaustive
search
for the "best" task to run. It touches every process in the runqueue. this
is
ok if the runqueue length is limited to a very small multiple of the #cpus.
But that is not what high end server systems encounter.

With the rising number of processors, lock contention can quickly become
a bottleneck. If we assume that load (#running-task) increases somewhat
linear with #cpus, the problem gets even worth because not only have I
increased the number of clients but also the lock hold time.

Clinging on to the statement that #running-tasks ~ #cpus, ofcourse saves
you from that dilemma, but not everybody is signing on to this limitation.

MQ and priority-list help in 2 ways.

MQ reduces the average lock holdtime because on average it only inspects
#running-tasks/#cpus tasks to make a local decision. It then goes on to
inspect (#cpus-1) datastructures representing the next best to run tasks
on those remote cpus all without holding the lock, thus substantially
reducing lock contention. Note we still search the entire set of runnable
tasks, however we do it in a distributed collaborative manner.
The only time we deviate from the current scheduler decision is in the
case when two cpus have identified the same remote task as a target for
remote stealing. In that case one will win and the other cpu will continue
looking somewhere else, although there might have been another tasks
on that cpu to steal.

priority list schedulers (PRS) only helps in reducing lock hold time,
which can result in some relieve wrt lock contention, but not a whole lot.
PRS can limit the lists it has to search based on the PROC_CHANGE_PENALTY.
It also keeps 0-counter in a list that is never inspected. One can
even go further and put YIELD tasks in a separate list, given that the
sys_sched_yield already does some optimizations.
The older version (12/00) posted on LSE is functionally equivalent to the
current scheduler.
I will put up another version this week that is based on a bitmask and
which is a bit more agressive in that it ignores the MM goodness boost of 1
which in my books is merely a tie breaker between two task of equal
goodness.

Beyond that we have done some work on cpu pooling, which is to identify
a set of cpus that form a scheduling set. We still consider in
reschedule_idle all cpus for preemption but in schedule it is sufficient
to only schedule within the own set. That again can limit lock hold time
with MQ and we have seen some improvements. We also deploy load balacing.

To summarize, we have taken great care of trying to preserve the current
scheduler semantics and have laid out a path to relax some of the semantics
for further improvements.
I don't believe that the HP scheduler is sufficient since it is lacking
load balacing, which naturally occurs in our MQ scheduler, and it lacks
the interactivity requirements that Ingo pointed out.

Most of these things are discussed in great detail in the writeups under
lse.sourceforge.net/scheduling. I also believe we show there that the
MQ performance for low thread counts is also matching the vanilla case.

I further don't understand the obsession of keeping the scheduler simple.
If there are improvements and I don't believe the MQ is all that
complicated
and its well documented, why not put it in.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Mike Kravetz [EMAIL PROTECTED] on 04/03/2001 10:47:00 PM

To:   Fabio Riccardi [EMAIL PROTECTED]
cc:   Mike Kravetz [EMAIL PROTECTED], Ingo Molnar [EMAIL PROTECTED],
  Hubertus Franke/Watson/IBM@IBMUS, Linux Kernel List
  [EMAIL PROTECTED], Alan Cox [EMAIL PROTECTED]
Subject:  Re: a quest for a better scheduler



On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote:

 I have measured the HP and not the "scalability" patch because the two do
more
 or less the same thing and give me the same performance advantages, but
the
 former is a lot simpler and I could port it with no effort on any recent
 kernel.

Actually, there is a significant difference between the HP patch and
the one I developed.  In the HP patch, if there is a schedulable task
on the 'local' (c

Re: a quest for a better scheduler



I grant you that the code is not as clean as the current scheduler,
so maybe you missed that part.

For the priority scheduler:
Yes the task_to_qid assumes a NON-affinity (no cpu, no mm) to determine
the list index to where the task has to be enqueued.
However, if you wonder down to the search_range section of the scheduler,
you see that we  do add the "+1" for equal mm. All I did here was take
the goodness() function a part for search_range in order to insert some
extra code that speeds up the code.

Again, I don't want to lean to hard on the priority scheduler, because
we only did this to have a reference point regarding lock contention and
lock hold. This is stuff that many operating systems did years ago, most
of which have moved on to add MQ to that. So that is what the LSE
scheduling team is pushing.

I understand the dilemma that the Linux scheduler is in, namely satisfy
the low end at all cost. But I believe that in order for Linux to push
into the high end we need to address the scalability of the current
scheduler.
Simply handwaving and declaring that lots of running tasks is a stupid
idea doesn't get us there.
If indeed you assume that there #running-tasks ~ #cpus then each task
should alot a reasonable amount of work and any small overhead incurred
should be amortizable. However as we contend if #running-tasks  #cpus
is a situation we need to deal with, the living with the current lock
contention can really drag us down.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Ingo Molnar [EMAIL PROTECTED]@elte.hu on 04/04/2001 02:28:31 AM

Please respond to [EMAIL PROTECTED]

Sent by:  [EMAIL PROTECTED]


To:   Mike Kravetz [EMAIL PROTECTED]
cc:   Hubertus Franke/Watson/IBM@IBMUS, Fabio Riccardi
  [EMAIL PROTECTED], Linux Kernel List
  [EMAIL PROTECTED]
Subject:  Re: a quest for a better scheduler




On Tue, 3 Apr 2001, Mike Kravetz wrote:

 Our 'priority queue' implementation uses almost the same goodness
 function as the current scheduler.  The main difference between our
 'priority queue' scheduler and the current scheduler is the structure
 of the runqueue.  We break up the single runqueue into a set of
 priority based runqueues. The 'goodness' of a task determines what
 sub-queue a task is placed in.  Tasks with higher goodness values are
 placed in higher priority queues than tasks with lower goodness
 values. [...]

we are talking about the same thing, re-read my mail. this design assumes
that 'goodness' is constant in the sense that it does not depend on the
previous process.

and no, your are not using the same goodness() function, you omitted the
prev-mm == next-mm component to goodness(), due to this design
limitation.

 Ingo





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler



On Wed, 4 Apr 2001, Hubertus Franke wrote:

 I understand the dilemma that the Linux scheduler is in, namely
 satisfy the low end at all cost. [...]

nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
You are playing word games by suggesting that the current behavior prefers
'low end'. 'thousands of runnable processes' is not 'high end' at all,
it's 'broken end'. Thousands of runnable processes are the sign of a
broken application design, and 'fixing' the scheduler to perform better in
that case is just fixing the symptom. [changing the scheduler to perform
better in such situations is possible too, but all solutions proposed so
far had strings attached.]

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler



On Wed, 4 Apr 2001, Hubertus Franke wrote:

 It is not clear that yielding the same decision as the current
 scheduler is the ultimate goal to shoot for, but it allows
 comparision.

obviously the current scheduler is not cast into stone, it never was,
never will be.

but determining whether the current behavior is possible in a different
scheduler design is sure a good metric of how flexible that different
scheduler design is.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote:

On Wed, 4 Apr 2001, Hubertus Franke wrote:

Another point to raise is that the current scheduler does a exhaustive
search for the "best" task to run. It touches every process in the
runqueue. this is ok if the runqueue length is limited to a very small
multiple of the #cpus. [...]

indeed. The current scheduler handles UP and SMP systems, up to 32
(perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
approach anyway in many other subsystems too, Kanoj is doing some
scheduler work in that area.

I didn't seen anything from Kanoj but I did something myself for the wildfire:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

this is mostly an userspace issue, not really intended as a kernel optimization
(however it's also partly a kernel optimization). Basically it splits the load
of the numa machine into per-node load, there can be unbalanced load across the
nodes but fairness is guaranteed inside each node. It's not extremely well
tested but benchmarks were ok and it is at least certainly stable.

However Ingo consider that in a 32-way if you don't have at least 32 tasks
running all the time _always_ you're really stupid paying such big money for
nothing ;). So the fact the scheduler is optimized for 1/2 tasks running all
the time is not nearly enough for those machines (and of course also the
scheduling rate automatically increases linearly with the increase of the
number of cpus). Now it's perfectly fine that we don't ask the embedded and
desktop guys to pay for that, but a kernel configuration option to select an
algorithm that scales would be a good idea IMHO. The above patch just adds a
CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will be
hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: a quest for a better scheduler


On Wed, Apr 04, 2001 at 10:03:10AM -0400, Hubertus Franke wrote:
 I understand the dilemma that the Linux scheduler is in, namely satisfy
 the low end at all cost. [..]

We can satisfy the low end by making the numa scheduler at compile time (that's
what I did in my patch at least).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler



Yes, Andrea.

We actually already went a step further. We treat the scheduler
as a single entity, rather than splitting it up.

Based on the MQ scheduler we do the balancing across all nodes
at reschedule_idle time. We experimented to see whether only looking
for idle tasks remotely is a good idea, but it bloats the code.
Local scheduling decisions are limited to a set of cpus, which
could coincide with the cpus of one node, or if desirable on
smaller granularities.

In addition we implemented a global load balancing scheme that
ensures that load is equally distributed across all run queues.
This is made a loadable module, so you can plug in what ever
you want.

I grant in NUMA it might actually be desirable to separate
schedulers completely (we can do that trivially in reschedule_idle),
but you need loadbalancing at some point.

Here is the list of patches:

MultiQueue Scheduler: http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched
Pooling Extension:http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool
LoadBalancing:
http://lse.sourceforge.net/scheduling/LB/loadbalance.c


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Andrea Arcangeli [EMAIL PROTECTED] on 04/04/2001 11:08:47 AM

To:   Ingo Molnar [EMAIL PROTECTED]
cc:   Hubertus Franke/Watson/IBM@IBMUS, Mike Kravetz
  [EMAIL PROTECTED], Fabio Riccardi [EMAIL PROTECTED], Linux
  Kernel List [EMAIL PROTECTED],
  [EMAIL PROTECTED]
Subject:  Re: a quest for a better scheduler



On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote:

 On Wed, 4 Apr 2001, Hubertus Franke wrote:

  Another point to raise is that the current scheduler does a exhaustive
  search for the "best" task to run. It touches every process in the
  runqueue. this is ok if the runqueue length is limited to a very small
  multiple of the #cpus. [...]

 indeed. The current scheduler handles UP and SMP systems, up to 32
 (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
 approach anyway in many other subsystems too, Kanoj is doing some
 scheduler work in that area.

I didn't seen anything from Kanoj but I did something myself for the
wildfire:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1


this is mostly an userspace issue, not really intended as a kernel
optimization
(however it's also partly a kernel optimization). Basically it splits the
load
of the numa machine into per-node load, there can be unbalanced load across
the
nodes but fairness is guaranteed inside each node. It's not extremely well
tested but benchmarks were ok and it is at least certainly stable.

However Ingo consider that in a 32-way if you don't have at least 32 tasks
running all the time _always_ you're really stupid paying such big money
for
nothing ;). So the fact the scheduler is optimized for 1/2 tasks running
all
the time is not nearly enough for those machines (and of course also the
scheduling rate automatically increases linearly with the increase of the
number of cpus). Now it's perfectly fine that we don't ask the embedded and
desktop guys to pay for that, but a kernel configuration option to select
an
algorithm that scales would be a good idea IMHO. The above patch just adds
a
CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will
be
hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles).

Andrea



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler




You imply that high end means thousands of processes,
simply because we have shown that in our graphs as an
asymptotic end.
No, it could mean 5*#cpus and that is not all that absurd.
This could happen with a spike in demand.

TUX is not the greatest example to use, because it does
static webpage serving and is hence tied into the
file cache. If you move up the food chain, where middleware
is active and things are a bit more complicated than
sending stuff out of the filecache, having a bunch of threads
hanging around to deal with the spike in demand is common
practive, although you think its bad.

Now coming again to MQ (forget about priority list from
now on).

When we scan either the local or the realtime queues we do
use goodness value. So we have the same flexibility as the
current scheduler if it comes to goodness() flexibility and
future improvements.

For remote stealing we do use na_goodness to compare
for a better process to steal. Hence we would loose the "+1"
information here, nevertheless, once a decision
has been made, we still use preemption verification with goodness.
Eitherway, being off by "+1" for regular tasks once in a while
is no big deal, because this problem already exists today.
While walking the runqueue, another processor can either update
the counter value of task (ok that's covered by can_schedule)
or can run recalculate, which makes the decision that one is
about to make wrong from the point of always running the best.
But that's ok, because counter, nice etc. are approximations
anyway. Through in PROC_CHANGE_PENALTY and you have a few knobs
that are used to control interactivity and throughput.

If you look at some of the results with our reflex benchmark.
For the low thread count we basically show improved performance
on the 2,4,5,6,7,8 way system if #threads  #cpus, they all show
improvements. Look at the following numbers running the
reflex benchmark, left most columns are number of threads, with
typically 1/2 of them runnable.
You can clearly see that the priority list suffers from overhead,
but MQ is beating vanilla pretty much everywhere. Again, this
is due because of rapid scheduler invocation and resulting
lock contention.

  2-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
46.725  4.691  7.387
86.326  4.766  6.421
12   6.838  5.233  6.431
16   7.13  5.415  7.29

  4-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
49.42  7.873  10.592
88.143  3.799  8.691
12   7.877  3.537  8.101
16   7.688  2.953  7.144

  6-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
49.595  7.88  10.358
89.703  7.278  10.523
12   10.0164.652  10.985
16   9.882  3.629  10.525

  8-way
 2.4.1  2.4.1-mq1 2.4.1-prbit
49.804  8.033  10.548
810.4365.783  11.475
12   10.9256.787  11.646
16   11.4265.048  11.877
20   11.4383.895  11.633
24   11.4573.304  11.347
28   11.4953.073  11.09
32   11.53  2.944  10.898


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Ingo Molnar [EMAIL PROTECTED]@elte.hu on 04/04/2001 09:23:34 AM

Please respond to [EMAIL PROTECTED]

Sent by:  [EMAIL PROTECTED]


To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   Mike Kravetz [EMAIL PROTECTED], Fabio Riccardi
  [EMAIL PROTECTED], Linux Kernel List
  [EMAIL PROTECTED]
Subject:  Re: a quest for a better scheduler




On Wed, 4 Apr 2001, Hubertus Franke wrote:

 I understand the dilemma that the Linux scheduler is in, namely
 satisfy the low end at all cost. [...]

nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
You are playing word games by suggesting that the current behavior prefers
'low end'. 'thousands of runnable processes' is not 'high end' at all,
it's 'broken end'. Thousands of runnable processes are the sign of a
broken application design, and 'fixing' the scheduler to perform better in
that case is just fixing the symptom. [changing the scheduler to perform
better in such situations is possible too, but all solutions proposed so
far had strings attached.]

 Ingo




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler


Hubertus Franke wrote:
 
 This is an important point that Mike is raising and it also addresses a
 critique that Ingo issued yesterday, namely interactivity and fairness.
 The HP scheduler completely separates the per-CPU runqueues and does
 not take preemption goodness or alike into account. This can lead to
 unfair proportionment of CPU cycles, strong priority inversion and a
 potential
 lack of interactivity.
 
 Our MQ scheduler does yield the same decision in most cases
 (other than defined by some race condition on locks and counter members)
 

Let me stress that HP scheduler is not meant to be a replacement for the
current scheduler. The HP scheduler patch allows the current scheduler
to be replaced by another scheduler by loading a module in special
cases. HP is providing three different loadable scheduler modules -
Processor sets, Constant time scheduler, and Multi-runqueue scheduler.
Each one of these is geared towards a specific requirement. I would not
suggest using any of these for a generalized case. Processor sets
scheduler is designed to make scheduling decisions on a per-cpu basis
and not global basis. All we are trying to do is to make the current
scheduler modular so we CAN load an alternate scheduling policy module
in cases where the process mix requires a different scheduling policy or
the site policy require a different scheduling policy. An example of a
specific site processor allocation policy could be a site that runs a
database server on an MP machine along with a few other processes and
the administrator wants to guarantee that the database server process
always gets x percent of processing time or one CPU be dedicated to just
the database server. A policy like this is not meant to be fair and of
course, not a policy we want to impose upon others. The only HP changes
I would put in the kernel sources for general release would be the
changes to scheduler to allow such alternate (not necessarily fair or
the fastest for benchmarks, general process mix or 1000's of processes)
policies to be loaded. When a policy module is not loaded, scheduler
works exactly like the current scheduler even after HP patches. There
are people who could benefit from being able to load alternate policy
schedules. Fabio Ricardi happens to be one of them :-) Anyone who does
not want to even allow an alternate scheduler module to be loaded can
simply compile the alternate scheduler support out and that is how I
would expect most kernels to be compiled, especially the ones that ship
with various distributions. I would like the decision to include support
for alternate scheduler to be made by sys admins themselves for their
individual cases.

--
Khalid
 

Khalid Aziz Linux Development Laboratory
(970)898-9214Hewlett-Packard
[EMAIL PROTECTED]Fort Collins, CO
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler


Andrea Arcangeli wrote:
 
 On Wed, Apr 04, 2001 at 10:03:10AM -0400, Hubertus Franke wrote:
  I understand the dilemma that the Linux scheduler is in, namely satisfy
  the low end at all cost. [..]
 
 We can satisfy the low end by making the numa scheduler at compile time (that's
 what I did in my patch at least).
 
 Andrea

I fully agree with this approach. It would be very hard to design a
scheduler that performs equally well on a UP machine running couple of
processes and a NUMA machine. These two cases represent the two ends of
spectrum. The two schedulers should be separate IMO and one of them
should be selected at compile time.

--
Khalid
 

Khalid Aziz Linux Development Laboratory
(970)898-9214Hewlett-Packard
[EMAIL PROTECTED]Fort Collins, CO
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Christoph Hellwig


On Wed, Apr 04, 2001 at 09:44:22AM -0600, Khalid Aziz wrote:
 Let me stress that HP scheduler is not meant to be a replacement for the
 current scheduler. The HP scheduler patch allows the current scheduler
 to be replaced by another scheduler by loading a module in special
 cases.

HP also has a simple mq patch that is _not_ integrated into the pluggable
scheduler framework, I have used it myself.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-04 Thread Davide Libenzi



On 04-Apr-2001 Ingo Molnar wrote:
 
 On Tue, 3 Apr 2001, Fabio Riccardi wrote:
 
 I've spent my afternoon running some benchmarks to see if MQ patches
 would degrade performance in the "normal case".
 
 no doubt priority-queue can run almost as fast as the current scheduler.
 What i'm worried about is the restriction of the 'priority' of processes,
 it cannot depend on previous processes (and other 'current state')
 anymore.
 
 to so we have two separate issues:
 
#1: priority-queue: has the fundamental goodness() design limitation.
 
#2: per-CPU-runqueues: changes semantics, makes scheduler less
 effective due to nonglobal decisions.
 
 about #1: while right now the prev-mm rule appears to be a tiny issue (it
 might not affect performance significantly), but forbidding it by
 hardcoding the assumption into data structures is a limitation of *future*
 goodness() functions. Eg. with the possible emergence of CPU-level
 threading and other, new multiprocessing technologies, this could be a
 *big* mistake.

This is not correct Ingo. I haven't seen the HP code but if You store processes
in slots S :

S = FS( goodness(p, p-processor, p-mm) )

and You start scanning from the higher slots, as soon as you find a task with a
goodness G' that is equal to the max goodness in slot You can choose that
process to run.
Again, if You haven't found such a goodness during the slot scan but You've
found a task with a goodness G' :

G' = SG - DD

where :

SG = max slot goodness
DD = SG(i) - SG(i - 1)

You can select that task as the next to spin.
This was the behaviour that was implemented in my scheduler patch about 2 years
ago.
Beside this, I this that with such loads We've more serious problem to face
with inside the kernel.



- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler


 
 
 On Wed, 4 Apr 2001, Hubertus Franke wrote:
 
  Another point to raise is that the current scheduler does a exhaustive
  search for the "best" task to run. It touches every process in the
  runqueue. this is ok if the runqueue length is limited to a very small
  multiple of the #cpus. [...]
 
 indeed. The current scheduler handles UP and SMP systems, up to 32
 (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
 approach anyway in many other subsystems too, Kanoj is doing some
 scheduler work in that area.

Actually, not _much_ work has been done in this area. Alongwith a bunch
of other people, I have some ideas about what needs to be done. For 
example, for NUMA, we need to try hard to schedule a thread on the 
node that has most of its memory (for no reason other than to decrease
memory latency). Independently, some NUMA machines build in multilevel 
caches and local snoops that also means that specific processors on
the same node as the last_processor are also good candidates to run 
the process next.

To handle a single layer of shared caches, I have tried certain simple
things, mostly as hacks, but am not pleased with the results yet. More
testing needed.

Kanoj

 
 but the original claim was that the scheduling of thousands of runnable
 processes (which is not equal to having thousands of sleeping processes)
 must perform well - which is a completely different issue.
 
   Ingo
 
 
 ___
 Lse-tech mailing list
 [EMAIL PROTECTED]
 http://lists.sourceforge.net/lists/listinfo/lse-tech
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

I didn't seen anything from Kanoj but I did something myself for the wildfire:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

Just a quick comment. Andrea, unless your machine has some hardware
that imply pernode runqueues will help (nodelevel caches etc), I fail
to understand how this is helping you ... here's a simple theory though.
If your system is lightly loaded, your pernode queues are actually
implementing some sort of affinity, making sure processes stick to
cpus on nodes where they have allocated most of their memory on. I am
not sure what the situation will be under huge loads though.

As I have mentioned to some people before, percpu/pernode/percpuset/global
runqueues probably all have their advantages and disadvantages, and their
own sweet spots. Wouldn't it be really neat if a system administrator
or performance expert could pick and choose what scheduler behavior he
wants, based on how the system is going to be used?

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 09:39:23AM -0700, Kanoj Sarcar wrote:
example, for NUMA, we need to try hard to schedule a thread on the
node that has most of its memory (for no reason other than to decrease
memory latency). Independently, some NUMA machines build in multilevel
caches and local snoops that also means that specific processors on
the same node as the last_processor are also good candidates to run
the process next.

yes. That will probably need to be optional and choosen by the architecture
at compile time too. The probably most important factor to consider is the
penality of accessing remote memory, I think I can say on all recent and future
machines with a small difference between local and remote memory (and possibly
as you say with a decent cache protocol able to snoop cacheline data from the
other cpus even if they're not dirty) it's much better to always try to keep
the task in its last node. My patch is actually assuming recent machines and it
keeps the task in its last node if not in the last cpu and it keeps doing
memory allocation from there and it forgets about its original node where it
started allocating the memory from. This provided the best performance during
userspace CPU bound load as far I can tell and it also better distribute the load.

Kanoj could you also have a look at the NUMA related common code MM fixes I did
in this patch? I'd like to get them integrated (just skip the arch/alpha/*
include/asm-alpha/* stuff while reading the patch, they're totally orthogonal).

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/00_alpha-numa-1

If you prefer I can extract them in a more finegrinded patch just dropping
the alpha stuff by hand.

Re: [Lse-tech] Re: a quest for a better scheduler




Kanoj, our cpu-pooling + loadbalancing allows you to do that.
The system adminstrator can specify at runtime through a
/proc filesystem interface the cpu-pool-size, whether loadbalacing
should take place.
We can put limiting to the local cpu-set during reschedule_idle
back into the code, to make it complete and compatible with
the approach that Andrea has taken.
This way, one can fully isolate or combine cpu-sets.

here is the code for the pooling.
http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool

loadbalancing and /proc system combined in this module.
http://lse.sourceforge.net/scheduling/LB/loadbalance.c

a writeup explaining this concept is available under
http://lse.sourceforge.net/scheduling/LB/poolMQ.html

Prerequisite is the MQ scheduler...
http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched

We need to update these for 2.4.3  (coming)

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Kanoj Sarcar [EMAIL PROTECTED]@lists.sourceforge.net on
04/04/2001 12:50:58 PM

Sent by:  [EMAIL PROTECTED]


To:   [EMAIL PROTECTED] (Andrea Arcangeli)
cc:   [EMAIL PROTECTED] (Ingo Molnar), Hubertus Franke/Watson/IBM@IBMUS,
  [EMAIL PROTECTED] (Mike Kravetz), [EMAIL PROTECTED] (Fabio
  Riccardi), [EMAIL PROTECTED] (Linux Kernel List),
  [EMAIL PROTECTED]
Subject:  Re: [Lse-tech] Re: a quest for a better scheduler




 I didn't seen anything from Kanoj but I did something myself for the
wildfire:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1


 this is mostly an userspace issue, not really intended as a kernel
optimization
 (however it's also partly a kernel optimization). Basically it splits the
load
 of the numa machine into per-node load, there can be unbalanced load
across the
 nodes but fairness is guaranteed inside each node. It's not extremely
well
 tested but benchmarks were ok and it is at least certainly stable.


Just a quick comment. Andrea, unless your machine has some hardware
that imply pernode runqueues will help (nodelevel caches etc), I fail
to understand how this is helping you ... here's a simple theory though.
If your system is lightly loaded, your pernode queues are actually
implementing some sort of affinity, making sure processes stick to
cpus on nodes where they have allocated most of their memory on. I am
not sure what the situation will be under huge loads though.

As I have mentioned to some people before, percpu/pernode/percpuset/global
runqueues probably all have their advantages and disadvantages, and their
own sweet spots. Wouldn't it be really neat if a system administrator
or performance expert could pick and choose what scheduler behavior he
wants, based on how the system is going to be used?

Kanoj

___
Lse-tech mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lse-tech



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler


 
 
 
 Kanoj, our cpu-pooling + loadbalancing allows you to do that.
 The system adminstrator can specify at runtime through a
 /proc filesystem interface the cpu-pool-size, whether loadbalacing
 should take place.

Yes, I think this approach can support the various requirements
put on the scheduler.

I think there are two degrees of freedom that are needed in the
scheduler. One, as you say, for the sysadmin to be able to specify
what overall scheduler behavior he wants. 

Secondly, from the kernel standpoint, there needs to be perarch
hooks, to be able to utilize nodelevel/multilevel caches, NUMA
aspects etc.

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

On Wed, Apr 04, 2001 at 09:50:58AM -0700, Kanoj Sarcar wrote:

I didn't seen anything from Kanoj but I did something myself for the wildfire:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

It helps by keeping the task in the same node if it cannot keep it in
the same cpu anymore.

Assume task A is sleeping and it last run on cpu 8 node 2. It gets a wakeup
and it gets running and for some reason cpu 8 is busy and there are other
cpus idle in the system. Now with the current scheduler it can be moved in any
cpu in the system, with the numa sched applied we will try to first reschedule
it in the idles cpus of node 2 for example. The per-node runqueue are mainly
necessary to implement the heuristic.

cpus on nodes where they have allocated most of their memory on. I am
not sure what the situation will be under huge loads though.

after all cpus are busy we try to reschedule only on the cpus of the local
node, that's why it can generate some unbalance yes, but it will tend to
rebalance over the time because some node will end with all tasks with
zero counter first if it's less loaded, and so then it will start
getting tasks with has_cpu 0 in the runqueues out of other nodes.

You may want to give it a try on your machines and see what difference it
makes, I'd be curious to know of course.

Re: a quest for a better scheduler



Well put, this how we can eliminate searching all bins or lists and that's
how we do it under.
http://lse.sourceforge.net/scheduling/2.4.1-pre8-prioSched.

If you have a list per priority level, then you can even pick the first one
you find if its on
the same level. That's what I tried in a more recent implementation. Also
combined that
with using a bitmask to represent non-empty tasks.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Davide Libenzi [EMAIL PROTECTED]@ewok.dev.mycio.com on 04/04/2001
12:12:54 PM

Sent by:  [EMAIL PROTECTED]


To:   Ingo Molnar [EMAIL PROTECTED]
cc:   Linus Torvalds [EMAIL PROTECTED], Alan Cox
  [EMAIL PROTECTED], Linux Kernel List
  [EMAIL PROTECTED], Hubertus Franke/Watson/IBM@IBMUS,
  Mike Kravetz [EMAIL PROTECTED], Fabio Riccardi
  [EMAIL PROTECTED]
Subject:  Re: a quest for a better scheduler




On 04-Apr-2001 Ingo Molnar wrote:

 On Tue, 3 Apr 2001, Fabio Riccardi wrote:

 I've spent my afternoon running some benchmarks to see if MQ patches
 would degrade performance in the "normal case".

 no doubt priority-queue can run almost as fast as the current scheduler.
 What i'm worried about is the restriction of the 'priority' of processes,
 it cannot depend on previous processes (and other 'current state')
 anymore.

 to so we have two separate issues:

#1: priority-queue: has the fundamental goodness() design limitation.

#2: per-CPU-runqueues: changes semantics, makes scheduler less
 effective due to nonglobal decisions.

 about #1: while right now the prev-mm rule appears to be a tiny issue
(it
 might not affect performance significantly), but forbidding it by
 hardcoding the assumption into data structures is a limitation of
*future*
 goodness() functions. Eg. with the possible emergence of CPU-level
 threading and other, new multiprocessing technologies, this could be a
 *big* mistake.

This is not correct Ingo. I haven't seen the HP code but if You store
processes
in slots S :

S = FS( goodness(p, p-processor, p-mm) )

and You start scanning from the higher slots, as soon as you find a task
with a
goodness G' that is equal to the max goodness in slot You can choose that
process to run.
Again, if You haven't found such a goodness during the slot scan but You've
found a task with a goodness G' :

G' = SG - DD

where :

SG = max slot goodness
DD = SG(i) - SG(i - 1)

You can select that task as the next to spin.
This was the behaviour that was implemented in my scheduler patch about 2
years
ago.
Beside this, I this that with such loads We've more serious problem to face
with inside the kernel.



- Davide




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-04 Thread Mike Kravetz


On Tue, Apr 03, 2001 at 09:21:57PM -0700, Fabio Riccardi wrote:
 I was actually suspecting that the extra lines in your patch were there for a
 reason :)
 
 A few questions:
 
 What is the real impact of a (slight) change in scheduling semantics?
 
 Under which situation one should notice a difference?

I assume your questions are directed at the difference between local
and global scheduling decisions.  As with most things the impact of
these differences depends on workload.  Most multi-queue scheduler
implementations make local scheduling decisions and depend on load
balancing for system wide fairness.  Schedulers which make global
decisions handle load balancing via their global policy.

The HP multi-queue implementation you are using performs no load
balancing.  Therefore, in a worst case situation you could have
low priority tasks sharing one CPU while high priority tasks are
sharing another.  To be fair, I have talked to the HP people and
this scheduler was never intended to be a general purpose solution.
Therefore, it makes little sense to comment its merits as such.

 As you state in your papers the global decision comes with a cost,
 is it worth it?

My primary motivation for attempting to perform the same global
decisions as the current scheduler was so that meaningful comparisons
could be made.  Also, by using the same global decision policy I
was able to avoid the issue of load balancing.  In most multi-queue
implementations, load balancing algorithms take considerable effort
to get working in a reasonable well performing manner.

 
 Could you make a port of your thing on recent kernels?

There is a 2.4.2 patch on the web page.  I'll put out a 2.4.3 patch
as soon as I get some time.

-- 
Mike Kravetz [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler



Correct, that's true.

Our patch does various things.
(a) limit search for a task to a admin specified set of cpu's
during schedule()..
(b) limits search for a preemptable task to another set of cpu's
during reschedule_idle()
 need to reactivate this functionality 10 lines of code
(c) loadbalancing, i.e. moving from queue to queue.
Currently we balance within a set and across sets.

Obviously in NUMA one could specify
 (a) such that multiple sets fall into the same node
 no node crossings.
 (b) specify this set to at least span a node
 (c) do some intelligent moving based on memory maps
  etc.

I guess (c) would be first instance on where to plug architecture
dependent information, e.g. how much memory footprint does a task
have on a particular node and how much would the moving cost.
The loadbalance we provide is a simple sceleton to tickle you mind,
not a solution. Nevertheless, one can see it can have some impact.

See for results for various combinations of poolsizes and balancings:
http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Kanoj Sarcar [EMAIL PROTECTED] on 04/04/2001 01:14:28 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   [EMAIL PROTECTED] (Linux Kernel List),
  [EMAIL PROTECTED]
Subject:  Re: [Lse-tech] Re: a quest for a better scheduler






 Kanoj, our cpu-pooling + loadbalancing allows you to do that.
 The system adminstrator can specify at runtime through a
 /proc filesystem interface the cpu-pool-size, whether loadbalacing
 should take place.

Yes, I think this approach can support the various requirements
put on the scheduler.

I think there are two degrees of freedom that are needed in the
scheduler. One, as you say, for the sysadmin to be able to specify
what overall scheduler behavior he wants.

Secondly, from the kernel standpoint, there needs to be perarch
hooks, to be able to utilize nodelevel/multilevel caches, NUMA
aspects etc.

Kanoj



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Paul McKenney



 Just a quick comment. Andrea, unless your machine has some hardware
 that imply pernode runqueues will help (nodelevel caches etc), I fail
 to understand how this is helping you ... here's a simple theory though.
 If your system is lightly loaded, your pernode queues are actually
 implementing some sort of affinity, making sure processes stick to
 cpus on nodes where they have allocated most of their memory on. I am
 not sure what the situation will be under huge loads though.

Exactly.  If a given task has run on a particular nodes for a while,
its memory will tend to be allocated on that node.  So preferentially
running it on another CPU on that same node should get you better
memory latency than would running it on some other node's CPUs.

In addition, continuing to run the task on a particular node means
that more of that task's memory is from that node, which again means
good memory latency.  In contrast, if you move a task back and forth
between nodes, it can end up with its memory spread over many nodes,
which means that it does not get good memory latency no matter where
you run it.

  Thanx, Paul

 As I have mentioned to some people before,
percpu/pernode/percpuset/global
 runqueues probably all have their advantages and disadvantages, and their
 own sweet spots. Wouldn't it be really neat if a system administrator
 or performance expert could pick and choose what scheduler behavior he
 wants, based on how the system is going to be used?

 Kanoj

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler


 
 It helps by keeping the task in the same node if it cannot keep it in
 the same cpu anymore.
 
 Assume task A is sleeping and it last run on cpu 8 node 2. It gets a wakeup
 and it gets running and for some reason cpu 8 is busy and there are other
 cpus idle in the system. Now with the current scheduler it can be moved in any
 cpu in the system, with the numa sched applied we will try to first reschedule
 it in the idles cpus of node 2 for example. The per-node runqueue are mainly
 necessary to implement the heuristic.


Yes. But this is not the best solution, if I can add on to the example
and make some assumptions.

Imagine that most of the program's memory is on node 1, it was scheduled
on node 2 cpu 8 momentarily (maybe because kswapd ran on node 1, other
higher priority processes took over other cpus on node 1, etc). 

Then, your patch will try to keep the process on node 2, which is not
neccessarily the best solution. Of course, as I mentioned before, if
you have a node local cache on node 2, that cache might have been warmed
enough to make scheduling on node 2 a good option. 

I am not saying there is a wrong or right answer, there are so many
possibilities, everything probably works and breaks under different
circumstances. 

Btw, while we are swapping patches, the patch at

http://oss.sgi.com/projects/numa/download/sched242.patch

tries to implement per-arch scheduling. The current scheduler behavior
is smp_arch_goodness() and smp_pick_cpu(), but the patch allows the
possibility for a specific platform to change that to something else. 

Linus has seen this patch, and agrees to it in principle. He does not
consider this 2.4 material though. Of course, I am completely open to
Ingo (or someone else) coming up with a different way of providing the
same freedom to arch specific code.

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler


On Wed, Apr 04, 2001 at 10:49:04AM -0700, Kanoj Sarcar wrote:
 Imagine that most of the program's memory is on node 1, it was scheduled
 on node 2 cpu 8 momentarily (maybe because kswapd ran on node 1, other
 higher priority processes took over other cpus on node 1, etc). 
 
 Then, your patch will try to keep the process on node 2, which is not
 neccessarily the best solution. Of course, as I mentioned before, if
 you have a node local cache on node 2, that cache might have been warmed
 enough to make scheduling on node 2 a good option. 

Exactly it made it a good option, and more important this heuristic can
only improve performance if compared to the mainline scheduler.

Infact I tried to reschedule the task back to its original node and it dropped
performance, anyways I cannot say to have done an extensive research on that, I
believe if we take care of some more history of the node migration we may
understand it's right time to push it back to its original node but that was
not obvious and I wanted a simple solution to boost the performance under CPU
bound load to start with.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler



I give you a concrete example:

Running DB2 on an SMP system.

In DB2 there is a processes/thread pool that is sized based
on memory and numcpus. People tell me that the size of this pool
is in the order of 100s for an 8-way system with reasonable
sized database. These maxagents determine the number of agents
that can simultaneously execute an SQL statement.

Requests are flying in for transactions (e.g. driven by TPC-W like
applications). The agents are grepped from the pool and concurrently
fire the SQL transactions.
Assuming that there is enough concurrency in the database, there is
no reason to believe that the majority of those active agents is
not effectively running. TPC-W loads have observed 100 of active
transactions at a time.

Ofcourse limiting the number of agents would reduce concurrently
running tasks, but would limit the responsiveness of the system.
Implementing a database in the kernel ala TUX doesn't seem to be
the right approach either (complexity, fault containment, ...)

Hope that is one example people accept.

I can dig up some information on WebSphere Applications.

I'd love to hear from some other applications that fall into
a similar category as the above and substantiate a bit the need
for 100s of running processes, without claiming that the
application is broke.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [EMAIL PROTECTED]
(w) 914-945-2003(fax) 914-945-4425   TL: 862-2003



Mark Hahn [EMAIL PROTECTED] on 04/04/2001 02:28:42 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:
Subject:  Re: a quest for a better scheduler



 ok if the runqueue length is limited to a very small multiple of the
#cpus.
 But that is not what high end server systems encounter.

do you have some example of this in mind?  so far, noone has
actually produced an example of a "high end" server that has
long runqueues.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-04 Thread Tim Wright


On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote:
 
 On Wed, 4 Apr 2001, Hubertus Franke wrote:
 
  I understand the dilemma that the Linux scheduler is in, namely
  satisfy the low end at all cost. [...]
 
 nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
 You are playing word games by suggesting that the current behavior prefers
 'low end'. 'thousands of runnable processes' is not 'high end' at all,
 it's 'broken end'. Thousands of runnable processes are the sign of a
 broken application design, and 'fixing' the scheduler to perform better in
 that case is just fixing the symptom. [changing the scheduler to perform
 better in such situations is possible too, but all solutions proposed so
 far had strings attached.]
 


Ingo, you continue to assert this without giving much evidence to back it up.
All the world is not a web server. If I'm running a large OLTP database with
thousands of clients, it's not at all unreasonable to expect periods where
several hundred (forget the thousands) want to be serviced by the database
engine. That sounds like hundreds of schedulable entities be they processes
or threads or whatever. This sort of load is regularly run on machine with
16-64 CPUs.

Now I will admit that it is conceivable that you can design an application that
finds out how many CPUs are available, creates threads to match that number
and tries to divvy up the work between them using some combination of polling
and asynchronous I/O etc. There are, however a number of problems with this
approach:
1) It assumes that this is the only workload on the machine. If not it quickly
becomes sub-optimal due to interactions between the workloads. This is a
problem that the kernel scheduler does not suffer from.
2) It requires *every* application designer to design an effective scheduler
into their application. I would submit that an effective scheduler is better
situated in the operating system.
3) It is not a familiar programming paradigm to many Unix/Linux/POSIX
programmers, so you have a sort of impedance mismatch going on.

Since the proposed scheduler changes being talked about here have been shown
to not hurt the "low end" and to dramatically improve the "high end", I fail
to understand the hostility to the changes. I can understand that you do not
feel that this is the correct way to architect an application, but if the
changes don't hurt you, why sabotage changes that also allow a different
method to work. There isn't one true way to do anything in computing.

Tim

-- 
Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-04 Thread Christopher Smith


--On Wednesday, April 04, 2001 15:16:32 -0700 Tim Wright [EMAIL PROTECTED] 
wrote:
 On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote:
 nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
 You are playing word games by suggesting that the current behavior
 prefers 'low end'. 'thousands of runnable processes' is not 'high end'
 at all, it's 'broken end'. Thousands of runnable processes are the sign
 of a broken application design, and 'fixing' the scheduler to perform
 better in that case is just fixing the symptom. [changing the scheduler
 to perform better in such situations is possible too, but all solutions
 proposed so far had strings attached.]

 Ingo, you continue to assert this without giving much evidence to back it
 up. All the world is not a web server. If I'm running a large OLTP
 database with thousands of clients, it's not at all unreasonable to
 expect periods where several hundred (forget the thousands) want to be
 serviced by the database engine. That sounds like hundreds of schedulable
 entities be they processes or threads or whatever. This sort of load is
 regularly run on machine with 16-64 CPUs.

Actually, it's not just OLTP, anytime you are doing time sharing between 
hundreds of users (something POSIX systems are supposed to be good at) this 
will happen.

 Now I will admit that it is conceivable that you can design an
 application that finds out how many CPUs are available, creates threads
 to match that number and tries to divvy up the work between them using
 some combination of polling and asynchronous I/O etc. There are, however
 a number of problems with this approach:

Actually, one way to semi-support this approach is to implement 
many-to-many threads as per the Solaris approach. This also requires 
significant hacking of both the kernel and the runtime, and certainly is 
significantly more error prone than trying to write a flexible scheduler.

One problem you didn't highlight that even the above case does not happily 
identify is that for security reasons you may very well need each user's 
requests to take place in a different process. If you don't, then you have 
to implement a very well tested and secure user-level security mechanism to 
ensure things like privacy (above and beyond the time-sharing).

The world is filled with a wide variety of types of applications, and 
unless you know two programming approaches are functionaly equivalent (and 
event driven/polling I/O vs. tons of running processes are NOT), you 
shouldn't say one approach is "broken". You could say it's a "broken" 
approach to building web servers. Unfortunately, things like kernels and 
standard libraries should work well in the general case.

--Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler


I was actually suspecting that the extra lines in your patch were there for a
reason :)

A few questions:

What is the real impact of a (slight) change in scheduling semantics?

Under which situation one should notice a difference?

As you state in your papers the global decision comes with a cost, is it worth it?

Could you make a port of your thing on recent kernels?
I tried and I failed and I don't have enough time to figure out why, that should be
trivial for you though.

TIA, ciao,

 - Fabio

Mike Kravetz wrote:

> On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote:
> >
> > I have measured the HP and not the "scalability" patch because the two do more
> > or less the same thing and give me the same performance advantages, but the
> > former is a lot simpler and I could port it with no effort on any recent
> > kernel.
>
> Actually, there is a significant difference between the HP patch and
> the one I developed.  In the HP patch, if there is a schedulable task
> on the 'local' (current CPU) runqueue it will ignore runnable tasks on
> other (remote) runqueues.  In the multi-queue patch I developed, the
> scheduler always attempts to make the same global scheduling decisions
> as the current scheduler.
>
> --
> Mike Kravetz [EMAIL PROTECTED]
> IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote:
> 
> I have measured the HP and not the "scalability" patch because the two do more
> or less the same thing and give me the same performance advantages, but the
> former is a lot simpler and I could port it with no effort on any recent
> kernel.

Actually, there is a significant difference between the HP patch and
the one I developed.  In the HP patch, if there is a schedulable task
on the 'local' (current CPU) runqueue it will ignore runnable tasks on
other (remote) runqueues.  In the multi-queue patch I developed, the
scheduler always attempts to make the same global scheduling decisions
as the current scheduler.

-- 
Mike Kravetz [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-03 Thread Christopher Smith

--On Tuesday, April 03, 2001 18:17:30 -0700 Fabio Riccardi 
<[EMAIL PROTECTED]> wrote:
> Alan Cox wrote:
> Indeed, I'm using RT sigio/sigwait event scheduling, bare clone threads
> and zero-copy io.

Fabio, I'm working on a similar solution, although I'm experimenting with 
SGI's KAIO patch to see what it can do. I've had to patch the kernel to 
implement POSIX style signal dispatch symantics (so that the thread which 
posted an I/O request doesn't have to be the one which catches the signal). 
Are you taking a similar approach, or is the lack of this behavior the 
reason you are using so many threads?

--Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

Alan Cox wrote:

> > for the "normal case" performance see my other message.
>
> I did - and with a lot of interest

thanks! :)

> > I agree that a better threading model would surely help in a web server, but to
> > me this is not an excuse to live up with a broken scheduler.
>
> The problem has always been - alternative scheduler, crappier performance for
> 2 tasks running (which is most boxes). If your numbers are right then the
> HP patch is working as well for 1 or 2 tasks too

Please verify them if you have a couple of spare hours.

BTW: I measured similar results for the "scalability" patches on a 2.4.1 kernel, it
would be worth the effort to seriously compare them from an architectural point of
view, but I don't have the time right now...

> > Unless we want to maintain the position tha the only way to achieve good
> > performance is to embed server applications in the kernel, some minimal help
> > should be provided to goodwilling user applications :)
>
> Indeed. I'd love to see you beat tux entirely in userspace.  It proves the
> rest of the API for the kernel is right

Indeed, I'm using RT sigio/sigwait event scheduling, bare clone threads and
zero-copy io.

If only I had a really asynchronous sendfile, or a smarter madvise that wouldn't
require to map files :)

My server cannot execute dynamic stuff yet, it relies on Apache for that.

Running X15 and TUX in the same conditions (i.e. dynamic code in Apache) I get
exactly the same score in both cases.

I'm adding a TUX-like dynamic interface, I hope to get it to work by next week, then
I'll make a real confrontation.

Regards, ciao,

 - Fabio

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-03 Thread Alan Cox


> for the "normal case" performance see my other message.

I did - and with a lot of interest

> I agree that a better threading model would surely help in a web server, but to
> me this is not an excuse to live up with a broken scheduler.

The problem has always been - alternative scheduler, crappier performance for
2 tasks running (which is most boxes). If your numbers are right then the
HP patch is working as well for 1 or 2 tasks too

> Unless we want to maintain the position tha the only way to achieve good
> performance is to embed server applications in the kernel, some minimal help
> should be provided to goodwilling user applications :)

Indeed. I'd love to see you beat tux entirely in userspace.  It proves the
rest of the API for the kernel is right


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

Alan,

for the "normal case" performance see my other message.

I agree that a better threading model would surely help in a web server, but to
me this is not an excuse to live up with a broken scheduler.

The X15 server I'm working on now is a sort of user-space TUX, it uses only 8
threads per CPU and it achieves the same level of performance of the kernel
space TUX. Even in this case the performance advantage of the multiqueue
scheduler is remarkable, especially on a multi-CPU (> 2) architecture.

To achieve some decent disk/CPU/network overlapping with the current linux
blocking disk IO limitations there is no way to avoid a "bunch of server
threads". I've (experimentally) figured out that 8-16 threads per CPU can
assure some reasonable overlapping, depending on the memory size and disk
subsystem speed. On a 8-way machine this means 64-128 active tasks, a total
disaster with the current scheduler.

Unless we want to maintain the position tha the only way to achieve good
performance is to embed server applications in the kernel, some minimal help
should be provided to goodwilling user applications :)

TIA, ciao,

 - Fabio

Alan Cox wrote:

> > Is there any special reason why any of those patches didn't make it to
> > the mainstream kernel code?
>
> All of them are worse for the normal case. Also 1500 running apache's isnt
> a remotely useful situation, you are thrashing the cache even if you are now
> not thrashing the scheduler. Use an httpd designed for that situation. Then
> you can also downgrade to a cheap pentium class box for the task ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler


Dear all,

I've spent my afternoon running some benchmarks to see if MQ patches would
degrade performance in the "normal case".

To measure performance I've used the latest lmbench and I have mesured the
kernel compile times on a dual pentium III box runing at 1GHz with an 133MHz
bus.

Results (attached) show that there is no measurable difference in performace
between a vanilla scheduler and a multiqueue scheduler when running only few
processes (the compilation benchmark runs essentially two processes, one per
CPU).

I have measured the HP and not the "scalability" patch because the two do more
or less the same thing and give me the same performance advantages, but the
former is a lot simpler and I could port it with no effort on any recent
kernel. It is indeed interesting to see that this patch was originally designed
for a 2.4.0-test10 kernel, and still works fine on the latest kernels, only a
minor change (one line) was required. A version of the patch for the 2.4.2-ac26
kernel is attached to this message.

Given the zero impact on "normal case" performance and the huge positive impact
(> 20%) in the heavy load case (70-80 runnable processess on a load of about
1400 total) I don't see why such a thing shouldn't be accepted in the
mainstream scheduler.

 - Fabio



--- sched.c.origTue Mar 27 17:30:58 2001
+++ sched.c Tue Apr  3 16:45:21 2001
@@ -34,6 +34,7 @@
 extern void timer_bh(void);
 extern void tqueue_bh(void);
 extern void immediate_bh(void);
+static inline void hop_queues(struct task_struct *, int);
 
 /*
  * scheduler variables
@@ -90,7 +91,8 @@
 spinlock_t runqueue_lock __cacheline_aligned = SPIN_LOCK_UNLOCKED;  /* inner */
 rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED; /* outer */
 
-static LIST_HEAD(runqueue_head);
+static struct list_head runqueue_head[NR_CPUS] = { 
+LIST_HEAD_INIT((runqueue_head[0]))};
+static LIST_HEAD(rt_queue_head);
 
 /*
  * We align per-CPU scheduling data on cacheline boundaries,
@@ -100,12 +102,15 @@
struct schedule_data {
struct task_struct * curr;
cycles_t last_schedule;
+   struct list_head runqueue_head;
} schedule_data;
char __pad [SMP_CACHE_BYTES];
-} aligned_data [NR_CPUS] __cacheline_aligned = { {{_task,0}}};
+} aligned_data [NR_CPUS] __cacheline_aligned = { {{_task,0,
+   LIST_HEAD_INIT((aligned_data[0].schedule_data.runqueue_head))}}};
 
 #define cpu_curr(cpu) aligned_data[(cpu)].schedule_data.curr
 #define last_schedule(cpu) aligned_data[(cpu)].schedule_data.last_schedule
+#define cpu_rq(cpu) (aligned_data[(cpu)].schedule_data.runqueue_head)
 
 struct kernel_stat kstat;
 
@@ -199,6 +204,33 @@
return goodness(p, cpu, prev->active_mm) - goodness(prev, cpu, 
prev->active_mm);
 }
 
+
+static inline int other_goodness(struct task_struct * p, int this_cpu, struct 
+mm_struct *this_mm)
+{
+   int weight;
+
+   /*
+* select the current process after every other
+* runnable process, but before the idle thread.
+* Also, dont trigger a counter recalculation.
+* 
+* Give the process a first-approximation goodness value
+* according to the number of clock-ticks it has left.
+*
+* Don't do any other calculations if the time slice is
+* over..
+*/
+   weight = p->counter;
+   if (!weight)
+   goto out2;
+   
+   /* .. and a slight advantage to the current MM */
+   if (p->mm == this_mm || !p->mm)
+   weight += 1;
+   weight += 20 - p->nice;
+out2:
+   return weight;
+}
 /*
  * This is ugly, but reschedule_idle() is very timing-critical.
  * We are called with the runqueue spinlock held and we must
@@ -266,6 +298,10 @@
} else {
if (oldest_idle == -1ULL) {
int prio = preemption_goodness(tsk, p, cpu);
+   /* 
+* this will never be true for < 400 HZ non
+* realtime. optimize this?  SAR
+*/
 
if (prio > max_prio) {
max_prio = prio;
@@ -277,6 +313,10 @@
tsk = target_tsk;
if (tsk) {
if (oldest_idle != -1ULL) {
+   /* push onto best queue */
+   if (p->policy == SCHED_OTHER){
+   hop_queues(p, tsk->processor);
+   }
best_cpu = tsk->processor;
goto send_now_idle;
}
@@ -306,20 +346,28 @@
  */
 static inline void add_to_runqueue(struct task_struct * p)
 {
-   list_add(>run_list, _head);
+   if (p->policy == SCHED_OTHER){
+   list_add(>run_list, _rq(p->processor));
+   } else  list_add(>run_list, _queue_head);
nr_running++;
 }
 
-static inline

Re: a quest for a better scheduler

On Tue, Apr 03, 2001 at 08:47:52PM +0200, Ingo Molnar wrote:
> 
> this restriction (independence of the priority from the previous process)
> is a fundamentally bad property, scheduling is nonlinear and affinity
> decisions depend on the previous context. [i had a priority-queue SMP
> scheduler running 2-3 years ago but dropped the idea due to this issue.]
> IMO it's more important to have a generic and flexible scheduler, and
> arbitrary, nonnatural restrictions tend to bite us later on.

It seems like we may be talking about two different things.

Our 'priority queue' implementation uses almost the same
goodness function as the current scheduler.  The main
difference between our 'priority queue' scheduler and the
current scheduler is the structure of the runqueue.  We
break up the single runqueue into a set of priority based
runqueues.  The 'goodness' of a task determines what sub-queue
a task is placed in.  Tasks with higher goodness values
are placed in higher priority queues than tasks with lower
goodness values.  This allows us to limit the scan of runnable
tasks to the highest priority sub-queue, as opposed to the
entire runquue.  When scanning the highest priority sub-queue
we use almost the same goodness function as that which is
used today (it does take CPU affinity into account).  In fact,
if we used the same goodness function I'm pretty sure we
would exactly match the behavior of the existing scheduler.

Hopefully, Hubertus Franke will speak up and provide more
details, as he is much more familiar with this implementation
than I am.

In any case, I think we have almost reached the conclusion
that our priority queue implementation may not be acceptable
due to the extra overhead in low thread counts.

> one issue regarding multiqueues is the ability of interactive processes to
> preempt other, lower priority processes on other CPUs. These sort of
> things tend to happen while using X. In a system where process priorities
> are different, scheduling decisions cannot be localized per-CPU
> (unfortunately), and 'the right to run' as such is a global thing.

Agreed.  This is why our multi-queue scheduler always attempts
to make global decisions.  It will preempt lower priority tasks
on other CPUs.  This implementation tends to make 'more
localized' scheduling decisions as contention on the runqueue
locks increase.  However, at this point one could argue that
we have moved away from a 'realistic' low task count system load.

> lmbench's lat_ctx for example, and other tools in lmbench trigger various
> scheduler workloads as well.

Thanks, I'll add these to our list.

-- 
Mike Kravetz [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Tue, 3 Apr 2001, Mike Kravetz wrote:

> [...] Currently, in this implementation we only deviate from the
> current scheduler in a small number of cases where tasks get a boost
> due to having the same memory map.

thread-thread-affinity pretty much makes it impossible to use a priority
queue. This 'small number of cases' is actually a fundamental issue: can
the 'goodness()' function be an arbitrary function of:

goodness(process_prev,process_next) := f(process_prev,process_next)

, or is the queue design restricting the choice of goodness() functions?
Priority queues for example restrict the choice of the goodness() function
to subset of functions:

goodness(process_prev,process_next) := f(process_next);

this restriction (independence of the priority from the previous process)
is a fundamentally bad property, scheduling is nonlinear and affinity
decisions depend on the previous context. [i had a priority-queue SMP
scheduler running 2-3 years ago but dropped the idea due to this issue.]
IMO it's more important to have a generic and flexible scheduler, and
arbitrary, nonnatural restrictions tend to bite us later on.

one issue regarding multiqueues is the ability of interactive processes to
preempt other, lower priority processes on other CPUs. These sort of
things tend to happen while using X. In a system where process priorities
are different, scheduling decisions cannot be localized per-CPU
(unfortunately), and 'the right to run' as such is a global thing.

> Can someone tell me what a good workload/benchmark would be to examine
> 'low thread count' performance? [...]

lmbench's lat_ctx for example, and other tools in lmbench trigger various
scheduler workloads as well.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Tue, Apr 03, 2001 at 10:55:12AM +0200, Ingo Molnar wrote:
> 
> you can easily make the scheduler fast in the many-processes case by
> sacrificing functionality in the much more realistic, few-processes case.
> None of the patch i've seen so far maintained the current scheduler's
> few-processes logic. But i invite you to improve the current scheduler's
> many-processes behavior, without hurting its behavior in the few-processes
> case.
> 

Maintaining the current scheduler's logic is exactly what we are trying
to do in the projects at:

http://lse.sourceforge.net/scheduling/

A common design goal for the the two alternative scheduler
implementations at this site is to maintain the current scheduler's
behavior/scheduling decisions.  In the case of the priority queue
scheduler, we have actually used a copy of the existing scheduler
running in parallel (in the same kernel) to determine if we are
making the same scheduling decisions.  Currently, in this implementation
we only deviate from the current scheduler in a small number of cases
where tasks get a boost due to having the same memory map.

The multi-queue implementation is more interesting.  It is also
designed to maintain the behavior of the current scheduler.  However,
as the runqueues get longer (and we start getting contention on the
runqueue locks) it starts to deviate from existing scheduler behavior
and make more local scheduling decisions.  Ideally, this implementation
will exhibit the behavior of the current scheduler at low thread
counts and make more localized decisions as pressure on the scheduler
is increased.

Neither of these implementations are at a point where I would advocate
their adoption; yet.

Can someone tell me what a good workload/benchmark would be to
examine 'low thread count' performance?  In the past people have
used the 'spinning on sched_yield' benchmark.  However, this now
makes little sense with the sched_yield optimizations introduced
in 2.4.  In addition, such a benchmark mostly ignores the
'reschedule_idle' component of the scheduler.  We have developed
a 'token passing' benchmark which attempts to address these issues
(called reflex at the above site).  However, I would really like
to get a pointer to a community acceptable workload/benchmark for
these low thread cases.

-- 
Mike Kravetz [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-03 Thread Alan Cox


> Is there any special reason why any of those patches didn't make it to
> the mainstream kernel code?

All of them are worse for the normal case. Also 1500 running apache's isnt
a remotely useful situation, you are thrashing the cache even if you are now
not thrashing the scheduler. Use an httpd designed for that situation. Then
you can also downgrade to a cheap pentium class box for the task ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

On Mon, 2 Apr 2001, Fabio Riccardi wrote:

> I sent a message a few days ago about some limitations I found in the
> linux scheduler.
>
> In servers like Apache where a large (> 1000) number of processes can
> be running at the same time and where many of them are runnable at the
> same time, the default Linux scheduler just starts trashing and the
> machine becomes very rapidly unusable.

it is *not* a scheduler problem. This is an application design problem.
Do not expect > 1000 runnable processes to perform well. Apache's
one-process-per-parallel-client-connection application model is especially
bad for such kind of loads. The scheduler has more important priorities to
worry about than to fix application design problems.

> Indeed I've tried the patches available on the sites for the
> multi-queue scheduler and I was amazed by the performance improvement
> that I got. Both patches allow me to get to a 100% real CPU
> utilization on a two way machine running ~1500 processes.

There were many promises along the years, and none as of now met all the
requirements the scheduler has to fulfill. 'many runnable processes' is
not on the top of the list - if we are scheduling like mad then we have
lost lots of performance already.

even with such 'improvements', the many-parallel-clients performance of
one-process-per-request HTTP server designs is an order slower than what
is possible.

having said that, improving this workload is still a useful task and we
added improvements to the scheduler to improve this case. But those
patches sacrifice other functionality just to get better
many-runnable-processes performance, which is unacceptable.

> What those patches do is quite simple, instead of having the single
> global process queue present in the normal Linux scheduler, they add
> multiple queues (one per CPU). In this way the scheduling decision can
> be greatly simplified and almost made local to each CPU. No hotspots,
> no global locks (well, almost).

the problem is that the *real* scheduling complexity (and needed
functionality) shows when the number of runnable processes is in the
neighborhood of the number of processors. Running many processes just
masks eg. load-balancing deficiencies in schedulers, so obviously the
simplest and most localized scheduler 'wins'. So comparing schedulers
based on many-runnable-processes load is like comparing cars solely based
on the size of their fuel tanks.

> Although some scalability problems are still there (there still is a
> global decision to make), the performance improvement obtained and the
> simplicity of the solution are remarkable.

you can easily make the scheduler fast in the many-processes case by
sacrificing functionality in the much more realistic, few-processes case.
None of the patch i've seen so far maintained the current scheduler's
few-processes logic. But i invite you to improve the current scheduler's
many-processes behavior, without hurting its behavior in the few-processes
case.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler



On Mon, 2 Apr 2001, Fabio Riccardi wrote:

 I sent a message a few days ago about some limitations I found in the
 linux scheduler.

 In servers like Apache where a large ( 1000) number of processes can
 be running at the same time and where many of them are runnable at the
 same time, the default Linux scheduler just starts trashing and the
 machine becomes very rapidly unusable.

it is *not* a scheduler problem. This is an application design problem.
Do not expect  1000 runnable processes to perform well. Apache's
one-process-per-parallel-client-connection application model is especially
bad for such kind of loads. The scheduler has more important priorities to
worry about than to fix application design problems.

 Indeed I've tried the patches available on the sites for the
 multi-queue scheduler and I was amazed by the performance improvement
 that I got. Both patches allow me to get to a 100% real CPU
 utilization on a two way machine running ~1500 processes.

There were many promises along the years, and none as of now met all the
requirements the scheduler has to fulfill. 'many runnable processes' is
not on the top of the list - if we are scheduling like mad then we have
lost lots of performance already.

even with such 'improvements', the many-parallel-clients performance of
one-process-per-request HTTP server designs is an order slower than what
is possible.

having said that, improving this workload is still a useful task and we
added improvements to the scheduler to improve this case. But those
patches sacrifice other functionality just to get better
many-runnable-processes performance, which is unacceptable.

 What those patches do is quite simple, instead of having the single
 global process queue present in the normal Linux scheduler, they add
 multiple queues (one per CPU). In this way the scheduling decision can
 be greatly simplified and almost made local to each CPU. No hotspots,
 no global locks (well, almost).

the problem is that the *real* scheduling complexity (and needed
functionality) shows when the number of runnable processes is in the
neighborhood of the number of processors. Running many processes just
masks eg. load-balancing deficiencies in schedulers, so obviously the
simplest and most localized scheduler 'wins'. So comparing schedulers
based on many-runnable-processes load is like comparing cars solely based
on the size of their fuel tanks.

 Although some scalability problems are still there (there still is a
 global decision to make), the performance improvement obtained and the
 simplicity of the solution are remarkable.

you can easily make the scheduler fast in the many-processes case by
sacrificing functionality in the much more realistic, few-processes case.
None of the patch i've seen so far maintained the current scheduler's
few-processes logic. But i invite you to improve the current scheduler's
many-processes behavior, without hurting its behavior in the few-processes
case.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler

2001-04-03 Thread Alan Cox


 Is there any special reason why any of those patches didn't make it to
 the mainstream kernel code?

All of them are worse for the normal case. Also 1500 running apache's isnt
a remotely useful situation, you are thrashing the cache even if you are now
not thrashing the scheduler. Use an httpd designed for that situation. Then
you can also downgrade to a cheap pentium class box for the task ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler


On Tue, Apr 03, 2001 at 10:55:12AM +0200, Ingo Molnar wrote:
 
 you can easily make the scheduler fast in the many-processes case by
 sacrificing functionality in the much more realistic, few-processes case.
 None of the patch i've seen so far maintained the current scheduler's
 few-processes logic. But i invite you to improve the current scheduler's
 many-processes behavior, without hurting its behavior in the few-processes
 case.
 

Maintaining the current scheduler's logic is exactly what we are trying
to do in the projects at:

http://lse.sourceforge.net/scheduling/

A common design goal for the the two alternative scheduler
implementations at this site is to maintain the current scheduler's
behavior/scheduling decisions.  In the case of the priority queue
scheduler, we have actually used a copy of the existing scheduler
running in parallel (in the same kernel) to determine if we are
making the same scheduling decisions.  Currently, in this implementation
we only deviate from the current scheduler in a small number of cases
where tasks get a boost due to having the same memory map.

The multi-queue implementation is more interesting.  It is also
designed to maintain the behavior of the current scheduler.  However,
as the runqueues get longer (and we start getting contention on the
runqueue locks) it starts to deviate from existing scheduler behavior
and make more local scheduling decisions.  Ideally, this implementation
will exhibit the behavior of the current scheduler at low thread
counts and make more localized decisions as pressure on the scheduler
is increased.

Neither of these implementations are at a point where I would advocate
their adoption; yet.

Can someone tell me what a good workload/benchmark would be to
examine 'low thread count' performance?  In the past people have
used the 'spinning on sched_yield' benchmark.  However, this now
makes little sense with the sched_yield optimizations introduced
in 2.4.  In addition, such a benchmark mostly ignores the
'reschedule_idle' component of the scheduler.  We have developed
a 'token passing' benchmark which attempts to address these issues
(called reflex at the above site).  However, I would really like
to get a pointer to a community acceptable workload/benchmark for
these low thread cases.

-- 
Mike Kravetz [EMAIL PROTECTED]
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a quest for a better scheduler