Re: wake_up vs. wake_up_sync
Manfred, Calling this a BUG is misleading. It is ok to be occasionally wrong regarding the preemption priority as long as RT tasks are not involved. This is due to the fact that PROC_CHANGE_PENALTIES are used, which already provide for some priority inversion. Hubertus Franke email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Manfred Spraul <[EMAIL PROTECTED]>@vger.kernel.org on 06/27/2001 06:41:29 PM Sent by: [EMAIL PROTECTED] To: Mike Kravetz <[EMAIL PROTECTED]> cc: Scott Long <[EMAIL PROTECTED]>, [EMAIL PROTECTED] Subject: Re: wake_up vs. wake_up_sync Mike Kravetz wrote: > > On Wed, Jun 27, 2001 at 11:22:19PM +0200, Manfred Spraul wrote: > > > Why would you want to prevent > > > reschedule_idle()? > > > > > If one process runs, wakes up another process and _knows_ that it's > > going to sleep immediately after the wake_up it doesn't need the > > reschedule_idle: the current cpu will be idle soon, the scheduler > > doesn't need to find another cpu for the woken up thread. > > I'm curious. How does the caller of wake_up_sync know that the > current cpu will soon be idle. Does it assume that there are no > other tasks on the runqueue waiting for a CPU? If there are other > tasks on the runqueue, isn't it possible that another task has a > higher goodness value than the task being awakened. In such a case, > isn't is possible that the awakened task could sit on the runqueue > (waiting for a CPU) while tasks with a lower goodness value are > allowed to run? > I found one combination where that could happen: process.thread A.1: highest priority, runs on cpu0 B.1: lowest priority, runs on cpu1 A.2: another thread of process A, priority PROC_CHANGE_PENALTY+PRIORITY(B.1)+1, sleeping. B.2: same priority as A.2, sleeping, same process as B.1 A.1: { wake_up("A.2"); /* nothing happens: preemption_goodness is 0 since B.1 has both PROC_CHANGE_PENALTY and the += 1 from 'same mm' */ wake_up_sync("B.2"); schedule(); /* schedule selects A.2 instead of B.2 due to the += 1 from 'same mm'. BUG: B.2 should replace B.1 on cpu1. The preemption_goodness is 1. */ IMHO obscure and very rare. But I just found a bigger problem: If wake_up_sync wakes up more than 1 process then cpus could remain in cpu_idle() although processes are on the runqueue without cpus. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: threading question
>I got that response too. When I pressed kernel people for details it turns >out that they think having hundreds of runnable threads/processes (mostly >the same thing under Linux) is wasteful. The scheduler is just not optimised >for that. Try out the http://lse.sourceforge.net/scheduling patches. The MQ kernel scheduler sure can handle this kind of load :-) Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 bert hubert <[EMAIL PROTECTED]>@vger.kernel.org on 06/13/2001 01:31:39 PM Sent by: [EMAIL PROTECTED] To: [EMAIL PROTECTED] cc: Subject: Re: threading question On Tue, Jun 12, 2001 at 12:06:40PM -0700, Kip Macy wrote: > This may sound like flamebait, but its not. Linux threads are basically > just processes that share the same address space. Their performance is > measurably worse than it is on most commercial Unixes and FreeBSD. Thread creation may be a bit slow. But the kludges to provide posix threads completely from userspace also hurt. Notably, they do not scale over multiple CPUs. > They are not, or at least two years ago, were not POSIX compliant > (they behaved badly with respect to signals). The impoverished POSIX threads are silly with respect to signals. I do almost all my programming these days with pthreads and I find that I really do not miss signals at all. > from Larry McVoy's home page attributed to Alan Cox illustrates this > reasonably well: "A computer is a state machine. Threads are for people > who can't program state machines." Sorry for not being more helpful. I got that response too. When I pressed kernel people for details it turns out that they think having hundreds of runnable threads/processes (mostly the same thing under Linux) is wasteful. The scheduler is just not optimised for that. Regards, bert -- http://www.PowerDNS.com Versatile DNS Services Trilab The Technology People 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.4 sluggish under fork load
I pointed that out to the folk who proposed this and gave him a fix that ensures that the child has at least a value of 2 higher. Given the child all and the parent nothing is TOTAL BOGUS. The parent essentially has to wait for a recalculate. This so-called fix has to go in the next release. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 "Adam J. Richter" <[EMAIL PROTECTED]>@vger.kernel.org on 05/01/2001 12:18:10 AM Sent by: [EMAIL PROTECTED] To: [EMAIL PROTECTED] cc: [EMAIL PROTECTED] Subject: Re: 2.4.4 sluggish under fork load >The fact that 2.4.4 gives the whole timeslice to the child >is just bogus to begin with. I only did that because I could not find another way to make the child run first that worked in practice. I tried other things before that. Since Peter Osterlund's SCHED_YIELD thing works, we no longer have to give all of the CPU to the child. The scheduler time slices are currently enormous, so as long as the child gets even one clock tick before the parent runs, it should reach the exec() if that is its plan. 1 tick = 10ms = 10 million cycles on a 1GHz CPU, which should be enough time to encrypt my /boot/vmlinux in twofish if it's in RAM. Adam J. Richter __ __ 4880 Stevens Creek Blvd, Suite 104 [EMAIL PROTECTED] \ / San Jose, California 95129-1034 +1 408 261-6630 | g g d r a s i l United States of America fax +1 408 261-6631 "Free Software For The Rest Of Us." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH(?): linux-2.4.4-pre2: fork should run child first
First you are wrong by assuming that setting current->counter=0 will guarantee that the child runs first. In an SMP it only means that this might initiate another recalculate and could run from there in parallel with the child. Actually quickly looking at the source code here. You don't have to call schedule() at all. A bit further down wake_up_process(p) is called, which in turn calls reschedule_idle(p). Hence we don't have to call schedule. If one satisfies the conditions: (preemption_goodness(current,p,p->processor) > 1) then the child should run. [ with (child==p) and (parent==current) ]. This is for the uniprocessor system. In the SMP both could continue to run. Looking at goodness computation, since p->mm == current->mm, p->nice == current->nice and p->processor == current->processor, all what matters is the difference in the counter values. My proposed patch always yield 1, which ofcourse doesn't have the desired effect. Here is a patch that always yields a diff of 2. However for odd number of current->counter it looses a token between the two. { long parcnt = current->counter; p->counter = (parcnt+((parcnt&1)?1:2)) >> 1; parcnt >>= 1; if (parcnt>0) { current->counter = 0; current->need_resched = 1; } else { current->counter = parcnt - 1; } There is the other view that I should not loose a token. In that case the following code will add a token in the odd counter case. I think that this is preferrable over the first solution. p->counter = (current->counter+3)>>1; current->counter = (current->counter >> 1) - 1; if (current->counter <= 0) { current->counter = 0; current->need_resched = 1; } Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 "Adam J. Richter" <[EMAIL PROTECTED]> on 04/12/2001 08:42:12 PM To: Hubertus Franke/Watson/IBM@IBMUS cc: Subject: Re: PATCH(?): linux-2.4.4-pre2: fork should run child first >p->counter = (current->counter + 1) >> 1; >current->counter = (current->counter - 1) >> 1; >schedule(); I don't have time to try this right now and I'm not sure what locks are held at that point in the code or whether schedule() will actually schedule a different process if the current one has current->counter > 0 (even if current->need_resched is set and even if another process has a higher proc->counter value). Even if it did work, your code is more complex and makes it less likely that the child will reach exec() before the parent runs again. So, I am not sure I would see the advantage if it did work. Adam J. Richter __ __ 4880 Stevens Creek Blvd, Suite 104 [EMAIL PROTECTED] \ / San Jose, California 95129-1034 +1 408 261-6630 | g g d r a s i l United States of America fax +1 408 261-6631 "Free Software For The Rest Of Us." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH(?): linux-2.4.4-pre2: fork should run child first
First the 2.4.3 tries to prefer the child. Only does it in half of the cases though (odd counter numbers). Your patch seems a bit radical for what you want to do. Taking away all tokens from the parents will require that it has to wait until the next recalculate loop. Since (p) and (current) share the same and the same why not simply make sure that the (p->counter) > (current->counter). If you are concerned that a tick is not enough to fire off the exec then make it a predefined constant. Try this ... this will guarantee that (p->counter) > (current->counter) and it seems not as radical p->counter = (current->counter + 1) >> 1; current->counter = (current->counter - 1) >> 1; if (!current->counter) current->need_resched = 1; instead of this - p->counter = (current->counter + 1) >> 1; - current->counter >>= 1; - if (!current->counter) - current->need_resched = 1; + p->counter = current->counter; + current->counter = 0; + current->need_resched = 1; Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 "Adam J. Richter" <[EMAIL PROTECTED]>@vger.kernel.org on 04/12/2001 04:55:16 AM Sent by: [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] cc: Subject: PATCH(?): linux-2.4.4-pre2: fork should run child first I remember sometime in the late 80's a fellow at UniSoft named Don whose last name escapes me just now told me about a paper presented at a Usenix symposium that had some measurements that purported that copy-on-write was a performance lose and better performance would be achieve by having fork() just copy all of the writable pages of the parent process. It turned out that the particular unix-like system on which these benchmarks were taken had a version of fork that did not run the child first. As it was explained to me then, most of the time, the child process from a fork will do just a few things and then do an exec(), releasing its copy-on-write references to the parent's pages, and that is the big win of copy-on-write for fork() in practice. This oversight was considered a big embarassment for the operating system in question, so I won't name it here. Guess why you're seeing this email. That's right. Linux-2.4.3's fork() does not run the child first. Consequently, the parent will probably generate unnecessary copy-on-write page copies until it burns through its remaining clock ticks (any COW's that the child causes will basically happen no matter what the order of execution is) or calls wait() (and while the wait is blocking, the parent's CPU priority will decay as the scheduler periodically recalculates process priorities, so that bit of dynamic priority has probably not been allocated where the user will be able to use it, if we want to look at "fairness" in such detail). I suppose that running the child first also has a minor advantage for clone() in that it should make programs that spawn lots of threads to do little bits of work behave better on machines with a small number of processors, since the threads that do so little work that they accomplish they finish within their time slice will not pile up before they have a chance to run. So, rather than give the parent's CPU priority to the child only if CLONE_VFORK is not set, I have decided to do a bit of machete surgery and have the child always inherit all of the parent's CPU priority all of the time. It simplifies the code and probably saves a few clock cycles (and before you say that this will cost a context switch, consider that the child will almost always run at least one time slice anyhow). I have attached the patch below. I have also adjusted the comment describing the code. Please let me know if this hand waving explanation is sufficient. I'm trying to be lazy on not do a measurement project to justify this relatively simple change. However, I do know, from a simple test program ("printf ("%d", fork());"), that this patch has the intended effect of running the child first. -- Adam J. Richter __ __ 4880 Stevens Creek Blvd, Suite 104 [EMAIL PROTECTED] \ / San Jose, California 95129-1034 +1 408 261-6630 | g g d r a s i l United States of America fax +1 408 261-6631 "Free Software For The Rest Of Us." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Bug in sys_sched_yield
In the recent optimizations to sys_sched_yield a bug was introduced. In the current implementation of sys_sched_yield() the aligned_data and idle_tasks are indexed by logical cpu-#. They should however be indexed by physical cpu-#. Since logical==physical on the x86 platform, it doesn't matter there, for other platforms where this is not true it will matter. Below is the fix. diff -uwrbBN linux-2.4.3/kernel/sched.c linux-2.4.3-fix/kernel/sched.c --- linux-2.4.3/kernel/sched.c Thu Mar 22 12:20:45 2001 +++ linux-2.4.3-fix/kernel/sched.c Wed Apr 11 11:27:16 2001 @@ -1024,9 +1024,11 @@ int i; // Substract non-idle processes running on other CPUs. -for (i = 0; i < smp_num_cpus; i++) - if (aligned_data[i].schedule_data.curr != idle_task(i)) +for (i = 0; i < smp_num_cpus; i++) { + int cpu = cpu_logical_map(i); + if (aligned_data[cpu].schedule_data.curr != idle_task(cpu)) nr_pending--; +} #else // on UP this process is on the runqueue as well nr_pending--; Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: a quest for a better scheduler
Torrey, nothing to worry about (or at least not yet). The new scheduler only replaces the current SMP scheduler, not the single cpu scheduler. So the devices that you refer to are not affected at all by these changes. The desire for interactivity etc, lead us to stick to the current proposed MQ scheduler semantics, namely not to completely isolate the runqueues from each other (e.g. the HP-MQ does so), but do some cross checking to ensure that high priority tasks are run when they need to. First you get the same (similar good or bad scheduler semantics as the current one) but at a significantly lower cost for medium to high end loads (defined as #runnable tasks > #cpus) Here is a simple approximate arithmetic. Assume the following: - Let X be the fixed overhead to get through the scheduler (cur or MQ) once. - Let Y the overhead of touching a task to inspect its goodness - Let N be the number of tasks - Let C be the number of cpus. The cost for the current scheduler is: X(cur) + N*Y. The cost for the MQ scheduler is: X(MQ) + (N/C)*Y + C*Y I assumed here equal runqueue length. You can see that this ballpark math shows that if X(cur) ~ X(MQ) MQ is expected to be better when (N/C) + C < N or if ( N > C*C/(C-1) ) C=2 N>4 C=3 N>4 C=4 N>5 C=5 N>6 C=6 N>7 C=7 N>8 C=8 N>9 Turns out that for C>2 this amounts to N > C+1 MQ will do better. Now this is in the face of lack of runqueue lock contention. When runqueue lock contention surfaces, then this will shift in favor of MQ. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Torrey Hoffman <[EMAIL PROTECTED]>@vger.kernel.org on 04/05/2001 07:53:27 PM Sent by: [EMAIL PROTECTED] To: "'Timothy D. Witham'" <[EMAIL PROTECTED]>, Linux Kernel List <[EMAIL PROTECTED]> cc: Subject: RE: a quest for a better scheduler Timothy D. Witham wrote : [...] > I propose that we work on setting up a straight forward test harness > that allows developers to quickly test a kernel patch against > various performance yardsticks. [... (proposed large server testbeds) ...] I like this idea, but could the testbeds also include some resource-constrained "embedded" or appliance-style systems, and include performance yardsticks for latency and responsiveness? It would be unfortunate if work on a revised scheduler resulted in Linux being a monster web server on 4-way systems, but having lousy interactive performance on web pads, hand helds, and set top boxes. How about a 120Mhz Pentium with 32MB of RAM and a flash disk, a 200 Mhz PowerPC with 64 MB? Maybe a Transmeta web pad? For the process load, something that tests responsiveness and latency - how about a set of processes doing multicast network transfers, CPU-intensive MPEG video and audio encode / decode, and a test app that measures "user response", i.e. if X Windows was running, would the mouse pointer move smoothly or jerkily? The "better" scheduler for these applications would make sure that multicast packets were never dropped, the MPEG decode never dropped frames, and the "user interface" stayed responsive, etc. Also, I would not mind a bit if the kernel had tuning options, either in configuration or through /proc to adjust for these very different situations. Torrey Hoffman [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: a quest for a better scheduler
Exellent idea Assuming that you have set up these environments already, what would be a real treat is to get the average runqueue length at a given time, for instance every second or so, while running some of these more sophisticated server oriented applications that you mention. >From that we can see which of the various assumption regarding runqueue length is holding up, when the runqueue is not empty. This would help the current discussion trememdously as we don't seem to be able to even agree on this. Simple bincount could do. If you want a kernel patch that counts every scheduling cycle I'll write it. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 "Timothy D. Witham" <[EMAIL PROTECTED]>@vger.kernel.org on 04/05/2001 06:38:41 PM Sent by: [EMAIL PROTECTED] To: Linux Kernel List <[EMAIL PROTECTED]> cc: [EMAIL PROTECTED] Subject: Re: a quest for a better scheduler I have been following this thread and thinking that everybody has some truth in what they are saying but with the absence of a repeatable test environment there really isn't a way of arriving at a data driven decision. Given the following conditions. 1)The diversity of the problem sets that developers are working on results in a real concern that another developers solution to their performance issue might result in a worsening of my performance situation. 2)Most of the developers have a given set of tests that they use when they make changes to determine if they change did what they want. 3)The Open Source Development Lab has the faculties for providing a large scale testing environment on several different configurations that remain the same over time. I propose that we work on setting up a straight forward test harness that allows developers to quickly test a kernel patch against various performance yardsticks. If we use the same set of hardware for the generation of the performance data from patch to patch there will be a correlation between the runs allowing for a real comparison of the different changes. The following should be taken only as a starting point. As for the base hardware configurations I propose: 2 way with 1 GB main memory and 2 IDE drives 4 way with 4 GB main memory and 16 disk drives 8 way with 8 GB main memory and 24 disk drives The types of applications that I have heard people express concern are: Web infrastructure: Apache TUX Jabber Corporate infrastructure: NFS raw TCP/IP performance Samba Database performance: Raw storage I/O performance OLTP workload General usage: compile speed (usually measured by kernel compile) The OSDL also has the ability to make some of the "for fee" benchmarks available for use for testing that is done onsite (network access to OSDL machines counts as onsite) so that people don't have to purchase individual copies. Things like SECMAIL2001, SPECSFS2.0 and SEPCWEB99 come to mind. Comments, suggestions, volunteers? -- Timothy D. Witham - Lab Director - [EMAIL PROTECTED] Open Source Development Lab Inc - A non-profit corporation 15275 SW Koll Parkway - Suite H - Beaverton OR, 97006 (503)-626-2455 x11 (office)(503)-702-2871 (cell) (503)-626-2455 (fax) On Wed, Apr 04, 2001 at 03:54:54PM -0700, Christopher Smith wrote: > --On Wednesday, April 04, 2001 15:16:32 -0700 Tim Wright <[EMAIL PROTECTED]> > wrote: > > On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote: > >> nope. The goal is to satisfy runnable processes in the range of NR_CPUS. > >> You are playing word games by suggesting that the current behavior > >> prefers 'low end'. 'thousands of runnable processes' is not 'high end' > >> at all, it's 'broken end'. Thousands of runnable processes are the sign > >> of a broken application design, and 'fixing' the scheduler to perform > >> better in that case is just fixing the symptom. [changing the scheduler > >> to perform better in such situations is possible too, but all solutions > >> proposed so far had strings attached.] > > > > Ingo, you continue to assert this without giving much evidence to back it > > up. All the world is not a web server. If I'm running a large OLTP > > database with thousands of clients, it's not at all unreasonable to > > expect periods where several hundred (forget the thousands) want to be > > serviced by the database engine. That sounds like hundreds of schedulable > > entities be they processes or threads or whatever. This sort of load is > > regularly run on machi
Re: a quest for a better scheduler
I give you a concrete example: Running DB2 on an SMP system. In DB2 there is a processes/thread pool that is sized based on memory and numcpus. People tell me that the size of this pool is in the order of 100s for an 8-way system with reasonable sized database. These determine the number of agents that can simultaneously execute an SQL statement. Requests are flying in for transactions (e.g. driven by TPC-W like applications). The agents are grepped from the pool and concurrently fire the SQL transactions. Assuming that there is enough concurrency in the database, there is no reason to believe that the majority of those active agents is not effectively running. TPC-W loads have observed 100 of active transactions at a time. Ofcourse limiting the number of agents would reduce concurrently running tasks, but would limit the responsiveness of the system. Implementing a database in the kernel ala TUX doesn't seem to be the right approach either (complexity, fault containment, ...) Hope that is one example people accept. I can dig up some information on WebSphere Applications. I'd love to hear from some other applications that fall into a similar category as the above and substantiate a bit the need for 100s of running processes, without claiming that the application is broke. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Mark Hahn <[EMAIL PROTECTED]> on 04/04/2001 02:28:42 PM To: Hubertus Franke/Watson/IBM@IBMUS cc: Subject: Re: a quest for a better scheduler > ok if the runqueue length is limited to a very small multiple of the #cpus. > But that is not what high end server systems encounter. do you have some example of this in mind? so far, noone has actually produced an example of a "high end" server that has long runqueues. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: a quest for a better scheduler
Correct, that's true. Our patch does various things. (a) limit search for a task to a admin specified set of cpu's during schedule().. (b) limits search for a preemptable task to another set of cpu's during reschedule_idle() (c) loadbalancing, i.e. moving from queue to queue. Currently we balance within a set and across sets. Obviously in NUMA one could specify (a) such that multiple sets fall into the same node no node crossings. (b) specify this set to at least span a node (c) do some intelligent moving based on memory maps etc. I guess (c) would be first instance on where to plug architecture dependent information, e.g. how much memory footprint does a task have on a particular node and how much would the moving cost. The loadbalance we provide is a simple sceleton to tickle you mind, not a solution. Nevertheless, one can see it can have some impact. See for results for various combinations of poolsizes and balancings: http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Kanoj Sarcar <[EMAIL PROTECTED]> on 04/04/2001 01:14:28 PM To: Hubertus Franke/Watson/IBM@IBMUS cc: [EMAIL PROTECTED] (Linux Kernel List), [EMAIL PROTECTED] Subject: Re: [Lse-tech] Re: a quest for a better scheduler > > > > Kanoj, our cpu-pooling + loadbalancing allows you to do that. > The system adminstrator can specify at runtime through a > /proc filesystem interface the cpu-pool-size, whether loadbalacing > should take place. Yes, I think this approach can support the various requirements put on the scheduler. I think there are two degrees of freedom that are needed in the scheduler. One, as you say, for the sysadmin to be able to specify what overall scheduler behavior he wants. Secondly, from the kernel standpoint, there needs to be perarch hooks, to be able to utilize nodelevel/multilevel caches, NUMA aspects etc. Kanoj - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: a quest for a better scheduler
Well put, this how we can eliminate searching all bins or lists and that's how we do it under. http://lse.sourceforge.net/scheduling/2.4.1-pre8-prioSched. If you have a list per priority level, then you can even pick the first one you find if its on the same level. That's what I tried in a more recent implementation. Also combined that with using a bitmask to represent non-empty tasks. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Davide Libenzi <[EMAIL PROTECTED]>@ewok.dev.mycio.com on 04/04/2001 12:12:54 PM Sent by: [EMAIL PROTECTED] To: Ingo Molnar <[EMAIL PROTECTED]> cc: Linus Torvalds <[EMAIL PROTECTED]>, Alan Cox <[EMAIL PROTECTED]>, Linux Kernel List <[EMAIL PROTECTED]>, Hubertus Franke/Watson/IBM@IBMUS, Mike Kravetz <[EMAIL PROTECTED]>, Fabio Riccardi <[EMAIL PROTECTED]> Subject: Re: a quest for a better scheduler On 04-Apr-2001 Ingo Molnar wrote: > > On Tue, 3 Apr 2001, Fabio Riccardi wrote: > >> I've spent my afternoon running some benchmarks to see if MQ patches >> would degrade performance in the "normal case". > > no doubt priority-queue can run almost as fast as the current scheduler. > What i'm worried about is the restriction of the 'priority' of processes, > it cannot depend on previous processes (and other 'current state') > anymore. > > to so we have two separate issues: > >#1: priority-queue: has the fundamental goodness() design limitation. > >#2: per-CPU-runqueues: changes semantics, makes scheduler less > effective due to nonglobal decisions. > > about #1: while right now the prev->mm rule appears to be a tiny issue (it > might not affect performance significantly), but forbidding it by > hardcoding the assumption into data structures is a limitation of *future* > goodness() functions. Eg. with the possible emergence of CPU-level > threading and other, new multiprocessing technologies, this could be a > *big* mistake. This is not correct Ingo. I haven't seen the HP code but if You store processes in slots S : S = FS( goodness(p, p->processor, p->mm) ) and You start scanning from the higher slots, as soon as you find a task with a goodness G' that is equal to the max goodness in slot You can choose that process to run. Again, if You haven't found such a goodness during the slot scan but You've found a task with a goodness G' : G' >= SG - DD where : SG = max slot goodness DD = SG(i) - SG(i - 1) You can select that task as the next to spin. This was the behaviour that was implemented in my scheduler patch about 2 years ago. Beside this, I this that with such loads We've more serious problem to face with inside the kernel. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: a quest for a better scheduler
Kanoj, our cpu-pooling + loadbalancing allows you to do that. The system adminstrator can specify at runtime through a /proc filesystem interface the cpu-pool-size, whether loadbalacing should take place. We can put limiting to the local cpu-set during reschedule_idle back into the code, to make it complete and compatible with the approach that Andrea has taken. This way, one can fully isolate or combine cpu-sets. here is the code for the pooling. http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool loadbalancing and /proc system combined in this module. http://lse.sourceforge.net/scheduling/LB/loadbalance.c a writeup explaining this concept is available under http://lse.sourceforge.net/scheduling/LB/poolMQ.html Prerequisite is the MQ scheduler... http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched We need to update these for 2.4.3 (coming) Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Kanoj Sarcar <[EMAIL PROTECTED]>@lists.sourceforge.net on 04/04/2001 12:50:58 PM Sent by: [EMAIL PROTECTED] To: [EMAIL PROTECTED] (Andrea Arcangeli) cc: [EMAIL PROTECTED] (Ingo Molnar), Hubertus Franke/Watson/IBM@IBMUS, [EMAIL PROTECTED] (Mike Kravetz), [EMAIL PROTECTED] (Fabio Riccardi), [EMAIL PROTECTED] (Linux Kernel List), [EMAIL PROTECTED] Subject: Re: [Lse-tech] Re: a quest for a better scheduler > > I didn't seen anything from Kanoj but I did something myself for the wildfire: > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1 > > this is mostly an userspace issue, not really intended as a kernel optimization > (however it's also partly a kernel optimization). Basically it splits the load > of the numa machine into per-node load, there can be unbalanced load across the > nodes but fairness is guaranteed inside each node. It's not extremely well > tested but benchmarks were ok and it is at least certainly stable. > Just a quick comment. Andrea, unless your machine has some hardware that imply pernode runqueues will help (nodelevel caches etc), I fail to understand how this is helping you ... here's a simple theory though. If your system is lightly loaded, your pernode queues are actually implementing some sort of affinity, making sure processes stick to cpus on nodes where they have allocated most of their memory on. I am not sure what the situation will be under huge loads though. As I have mentioned to some people before, percpu/pernode/percpuset/global runqueues probably all have their advantages and disadvantages, and their own sweet spots. Wouldn't it be really neat if a system administrator or performance expert could pick and choose what scheduler behavior he wants, based on how the system is going to be used? Kanoj ___ Lse-tech mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/lists/listinfo/lse-tech - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: a quest for a better scheduler
You imply that high end means thousands of processes, simply because we have shown that in our graphs as an asymptotic end. No, it could mean 5*#cpus and that is not all that absurd. This could happen with a spike in demand. TUX is not the greatest example to use, because it does static webpage serving and is hence tied into the file cache. If you move up the food chain, where middleware is active and things are a bit more complicated than sending stuff out of the filecache, having a bunch of threads hanging around to deal with the spike in demand is common practive, although you think its bad. Now coming again to MQ (forget about priority list from now on). When we scan either the local or the realtime queues we do use goodness value. So we have the same flexibility as the current scheduler if it comes to goodness() flexibility and future improvements. For remote stealing we do use na_goodness to compare for a better process to steal. Hence we would loose the "+1" information here, nevertheless, once a decision has been made, we still use preemption verification with goodness. Eitherway, being off by "+1" for regular tasks once in a while is no big deal, because this problem already exists today. While walking the runqueue, another processor can either update the counter value of task (ok that's covered by can_schedule) or can run recalculate, which makes the decision that one is about to make wrong from the point of always running the best. But that's ok, because counter, nice etc. are approximations anyway. Through in PROC_CHANGE_PENALTY and you have a few knobs that are used to control interactivity and throughput. If you look at some of the results with our reflex benchmark. For the low thread count we basically show improved performance on the 2,4,5,6,7,8 way system if #threads < #cpus, they all show improvements. Look at the following numbers running the reflex benchmark, left most columns are number of threads, with typically 1/2 of them runnable. You can clearly see that the priority list suffers from overhead, but MQ is beating vanilla pretty much everywhere. Again, this is due because of rapid scheduler invocation and resulting lock contention. 2-way 2.4.1 2.4.1-mq1 2.4.1-prbit 46.725 4.691 7.387 86.326 4.766 6.421 12 6.838 5.233 6.431 16 7.13 5.415 7.29 4-way 2.4.1 2.4.1-mq1 2.4.1-prbit 49.42 7.873 10.592 88.143 3.799 8.691 12 7.877 3.537 8.101 16 7.688 2.953 7.144 6-way 2.4.1 2.4.1-mq1 2.4.1-prbit 49.595 7.88 10.358 89.703 7.278 10.523 12 10.0164.652 10.985 16 9.882 3.629 10.525 8-way 2.4.1 2.4.1-mq1 2.4.1-prbit 49.804 8.033 10.548 810.4365.783 11.475 12 10.9256.787 11.646 16 11.4265.048 11.877 20 11.4383.895 11.633 24 11.4573.304 11.347 28 11.4953.073 11.09 32 11.53 2.944 10.898 Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Ingo Molnar <[EMAIL PROTECTED]>@elte.hu> on 04/04/2001 09:23:34 AM Please respond to <[EMAIL PROTECTED]> Sent by: <[EMAIL PROTECTED]> To: Hubertus Franke/Watson/IBM@IBMUS cc: Mike Kravetz <[EMAIL PROTECTED]>, Fabio Riccardi <[EMAIL PROTECTED]>, Linux Kernel List <[EMAIL PROTECTED]> Subject: Re: a quest for a better scheduler On Wed, 4 Apr 2001, Hubertus Franke wrote: > I understand the dilemma that the Linux scheduler is in, namely > satisfy the low end at all cost. [...] nope. The goal is to satisfy runnable processes in the range of NR_CPUS. You are playing word games by suggesting that the current behavior prefers 'low end'. 'thousands of runnable processes' is not 'high end' at all, it's 'broken end'. Thousands of runnable processes are the sign of a broken application design, and 'fixing' the scheduler to perform better in that case is just fixing the symptom. [changing the scheduler to perform better in such situations is possible too, but all solutions proposed so far had strings attached.] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: a quest for a better scheduler
Yes, Andrea. We actually already went a step further. We treat the scheduler as a single entity, rather than splitting it up. Based on the MQ scheduler we do the balancing across all nodes at reschedule_idle time. We experimented to see whether only looking for idle tasks remotely is a good idea, but it bloats the code. Local scheduling decisions are limited to a set of cpus, which could coincide with the cpus of one node, or if desirable on smaller granularities. In addition we implemented a global load balancing scheme that ensures that load is equally distributed across all run queues. This is made a loadable module, so you can plug in what ever you want. I grant in NUMA it might actually be desirable to separate schedulers completely (we can do that trivially in reschedule_idle), but you need loadbalancing at some point. Here is the list of patches: MultiQueue Scheduler: http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched Pooling Extension:http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool LoadBalancing: http://lse.sourceforge.net/scheduling/LB/loadbalance.c Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Andrea Arcangeli <[EMAIL PROTECTED]> on 04/04/2001 11:08:47 AM To: Ingo Molnar <[EMAIL PROTECTED]> cc: Hubertus Franke/Watson/IBM@IBMUS, Mike Kravetz <[EMAIL PROTECTED]>, Fabio Riccardi <[EMAIL PROTECTED]>, Linux Kernel List <[EMAIL PROTECTED]>, [EMAIL PROTECTED] Subject: Re: a quest for a better scheduler On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote: > > On Wed, 4 Apr 2001, Hubertus Franke wrote: > > > Another point to raise is that the current scheduler does a exhaustive > > search for the "best" task to run. It touches every process in the > > runqueue. this is ok if the runqueue length is limited to a very small > > multiple of the #cpus. [...] > > indeed. The current scheduler handles UP and SMP systems, up to 32 > (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different > approach anyway in many other subsystems too, Kanoj is doing some > scheduler work in that area. I didn't seen anything from Kanoj but I did something myself for the wildfire: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1 this is mostly an userspace issue, not really intended as a kernel optimization (however it's also partly a kernel optimization). Basically it splits the load of the numa machine into per-node load, there can be unbalanced load across the nodes but fairness is guaranteed inside each node. It's not extremely well tested but benchmarks were ok and it is at least certainly stable. However Ingo consider that in a 32-way if you don't have at least 32 tasks running all the time _always_ you're really stupid paying such big money for nothing ;). So the fact the scheduler is optimized for 1/2 tasks running all the time is not nearly enough for those machines (and of course also the scheduling rate automatically increases linearly with the increase of the number of cpus). Now it's perfectly fine that we don't ask the embedded and desktop guys to pay for that, but a kernel configuration option to select an algorithm that scales would be a good idea IMHO. The above patch just adds a CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will be hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: a quest for a better scheduler
I grant you that the code is not as clean as the current scheduler, so maybe you missed that part. For the priority scheduler: Yes the task_to_qid assumes a NON-affinity (no cpu, no mm) to determine the list index to where the task has to be enqueued. However, if you wonder down to the search_range section of the scheduler, you see that we do add the "+1" for equal mm. All I did here was take the goodness() function a part for search_range in order to insert some extra code that speeds up the code. Again, I don't want to lean to hard on the priority scheduler, because we only did this to have a reference point regarding lock contention and lock hold. This is stuff that many operating systems did years ago, most of which have moved on to add MQ to that. So that is what the LSE scheduling team is pushing. I understand the dilemma that the Linux scheduler is in, namely satisfy the low end at all cost. But I believe that in order for Linux to push into the high end we need to address the scalability of the current scheduler. Simply handwaving and declaring that lots of running tasks is a stupid idea doesn't get us there. If indeed you assume that there #running-tasks ~ #cpus then each task should alot a reasonable amount of work and any small overhead incurred should be amortizable. However as we contend if #running-tasks >> #cpus is a situation we need to deal with, the living with the current lock contention can really drag us down. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Ingo Molnar <[EMAIL PROTECTED]>@elte.hu> on 04/04/2001 02:28:31 AM Please respond to <[EMAIL PROTECTED]> Sent by: <[EMAIL PROTECTED]> To: Mike Kravetz <[EMAIL PROTECTED]> cc: Hubertus Franke/Watson/IBM@IBMUS, Fabio Riccardi <[EMAIL PROTECTED]>, Linux Kernel List <[EMAIL PROTECTED]> Subject: Re: a quest for a better scheduler On Tue, 3 Apr 2001, Mike Kravetz wrote: > Our 'priority queue' implementation uses almost the same goodness > function as the current scheduler. The main difference between our > 'priority queue' scheduler and the current scheduler is the structure > of the runqueue. We break up the single runqueue into a set of > priority based runqueues. The 'goodness' of a task determines what > sub-queue a task is placed in. Tasks with higher goodness values are > placed in higher priority queues than tasks with lower goodness > values. [...] we are talking about the same thing, re-read my mail. this design assumes that 'goodness' is constant in the sense that it does not depend on the previous process. and no, your are not using the same goodness() function, you omitted the prev->mm == next->mm component to goodness(), due to this design limitation. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: a quest for a better scheduler
This is an important point that Mike is raising and it also addresses a critique that Ingo issued yesterday, namely interactivity and fairness. The HP scheduler completely separates the per-CPU runqueues and does not take preemption goodness or alike into account. This can lead to unfair proportionment of CPU cycles, strong priority inversion and a potential lack of interactivity. Our MQ scheduler does yield the same decision in most cases (other than defined by some race condition on locks and counter members) It is not clear that yielding the same decision as the current scheduler is the ultimate goal to shoot for, but it allows comparision. Another point to raise is that the current scheduler does a exhaustive search for the "best" task to run. It touches every process in the runqueue. this is ok if the runqueue length is limited to a very small multiple of the #cpus. But that is not what high end server systems encounter. With the rising number of processors, lock contention can quickly become a bottleneck. If we assume that load (#running-task) increases somewhat linear with #cpus, the problem gets even worth because not only have I increased the number of clients but also the lock hold time. Clinging on to the statement that #running-tasks ~ #cpus, ofcourse saves you from that dilemma, but not everybody is signing on to this limitation. MQ and priority-list help in 2 ways. MQ reduces the average lock holdtime because on average it only inspects #running-tasks/#cpus tasks to make a local decision. It then goes on to inspect (#cpus-1) datastructures representing the next best to run tasks on those remote cpus all without holding the lock, thus substantially reducing lock contention. Note we still search the entire set of runnable tasks, however we do it in a distributed collaborative manner. The only time we deviate from the current scheduler decision is in the case when two cpus have identified the same remote task as a target for remote stealing. In that case one will win and the other cpu will continue looking somewhere else, although there might have been another tasks on that cpu to steal. priority list schedulers (PRS) only helps in reducing lock hold time, which can result in some relieve wrt lock contention, but not a whole lot. PRS can limit the lists it has to search based on the PROC_CHANGE_PENALTY. It also keeps 0-counter in a list that is never inspected. One can even go further and put YIELD tasks in a separate list, given that the sys_sched_yield already does some optimizations. The older version (12/00) posted on LSE is functionally equivalent to the current scheduler. I will put up another version this week that is based on a bitmask and which is a bit more agressive in that it ignores the MM goodness boost of 1 which in my books is merely a tie breaker between two task of equal goodness. Beyond that we have done some work on cpu pooling, which is to identify a set of cpus that form a scheduling set. We still consider in reschedule_idle all cpus for preemption but in schedule it is sufficient to only schedule within the own set. That again can limit lock hold time with MQ and we have seen some improvements. We also deploy load balacing. To summarize, we have taken great care of trying to preserve the current scheduler semantics and have laid out a path to relax some of the semantics for further improvements. I don't believe that the HP scheduler is sufficient since it is lacking load balacing, which naturally occurs in our MQ scheduler, and it lacks the interactivity requirements that Ingo pointed out. Most of these things are discussed in great detail in the writeups under lse.sourceforge.net/scheduling. I also believe we show there that the MQ performance for low thread counts is also matching the vanilla case. I further don't understand the obsession of keeping the scheduler simple. If there are improvements and I don't believe the MQ is all that complicated and its well documented, why not put it in. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Mike Kravetz <[EMAIL PROTECTED]> on 04/03/2001 10:47:00 PM To: Fabio Riccardi <[EMAIL PROTECTED]> cc: Mike Kravetz <[EMAIL PROTECTED]>, Ingo Molnar <[EMAIL PROTECTED]>, Hubertus Franke/Watson/IBM@IBMUS, Linux Kernel List <[EMAIL PROTECTED]>, Alan Cox <[EMAIL PROTECTED]> Subject: Re: a quest for a better scheduler On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote: > > I have measured the HP and not the "scalability" patch because the two do more > or less the same thing and give me the same performance advantages, but the > former is a lot simpler and I could port it with no effort on any recent > kernel. Actually, there is a significant difference between the HP patch a
Re: [Lse-tech] more on scheduler benchmarks
Mike, Deactivating that optimization is a good idea. What we are interested in is what the general latency of the scheduler code is. This should help to determine that. The only problem I have with sched_yield like benchmarks is that it creates artificial lock contention as we basically spent most of the time other then context switching + syscall under the scheduler lock. This we won't see in real apps, that's why I think the chatroom numbers are probably better indicators. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Mike Kravetz <[EMAIL PROTECTED]>@lists.sourceforge.net on 01/22/2001 01:17:38 PM Sent by: [EMAIL PROTECTED] To: [EMAIL PROTECTED] cc: [EMAIL PROTECTED], Ingo Molnar <[EMAIL PROTECTED]> Subject: [Lse-tech] more on scheduler benchmarks Last week while discussing scheduler benchmarks, Bill Hartner made a comment something like the following "the benchmark may not even be invoking the scheduler as you expect". This comment did not fully sink in until this weekend when I started thinking about changes made to sched_yield() in 2.4.0. (I'm cc'ing Ingo Molnar because I think he was involved in the changes). If you haven't taken a look at sys_sched_yield() in 2.4.0, I suggest that you do that now. A result of new optimizations made to sys_sched_yield() is that calling sched_yield() does not result in a 'reschedule' if there are no tasks waiting for CPU resources. Therefore, I would claim that running 'scheduler benchmarks' which loop doing sched_yield() seem to have little meaning/value for runs where the number of looping tasks is less than then number of CPUs in the system. Is that an accurate statement? If the above is accurate, then I am wondering what would be a good scheduler benchmark for these low task count situations. I could undo the optimizations in sys_sched_yield() (for testing purposes only!), and run the existing benchmarks. Can anyone suggest a better solution? Thanks, -- Mike Kravetz [EMAIL PROTECTED] IBM Linux Technology Center 15450 SW Koll Parkway Beaverton, OR 97006-6063 (503)578-3494 ___ Lse-tech mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/lists/listinfo/lse-tech - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: multi-queue scheduler update
Per popular demand. Here are a few numbers for small thread counts running the sched_yield_test benchmark on a 2-way SMP with the following characteristics. model name : Pentium III (Katmai) stepping: 3 cpu MHz : 551.266 cache size : 512 KB I compare 2.4.1-pre8 kernels (vanilla, table/prio scheduler and multiqueue). #T : van prio MQ -- 1 : 0.591 0.582 0.750 2 : 0.295 0.293 0.377 3 : 2.091 2.373 1.010 4 : 1.894 1.783 1.558 5 : 1.949 1.794 1.591 6 : 2.003 1.803 1.605 7 : 2.050 1.805 1.654 8 : 2.118 1.816 1.676 9 : 2.174 1.811 1.708 10 : 2.235 1.821 1.744 11 : 2.304 1.823 1.780 12 : 2.365 1.831 1.863 13 : 2.427 1.829 1.870 14 : 2.494 1.841 1.950 15 : 2.578 1.839 1.959 16 : 2.691 1.865 2.043 17 : 2.804 1.855 2.041 18 : 2.893 1.873 2.127 19 : 3.001 1.851 2.079 20 : 3.098 1.878 2.182 21 : 3.191 1.851 2.178 22 : 3.263 1.884 2.233 23 : 3.332 1.850 2.231 24 : 3.403 1.901 2.272 25 : 3.472 1.865 2.251 26 : 3.540 1.923 2.305 27 : 3.604 1.872 2.295 28 : 3.680 1.900 2.333 29 : 4.204 1.883 2.329 30 : 4.256 1.944 2.358 31 : 3.875 1.936 2.325 32 : 4.476 1.953 2.339 Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Andrea Arcangeli <[EMAIL PROTECTED]>@lists.sourceforge.net on 01/18/2001 08:30:41 PM Sent by: [EMAIL PROTECTED] To: Mike Kravetz <[EMAIL PROTECTED]> cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: [Lse-tech] Re: multi-queue scheduler update On Thu, Jan 18, 2001 at 04:52:25PM -0800, Mike Kravetz wrote: > was less than the number of processors. I'll give the tests a try > with a smaller number of threads. I'm also open to suggestions for OK! > what benchmarks/test methods I could use for scheduler testing. If > you remember what people have used in the past, please let me know. It was this one IIRC (it spawns threads calling sched_yield() in loop). /* Tester for the kernel's speed in scheduling. (C) 1999 / Willy Tarreau <[EMAIL PROTECTED]> Modified by Davide Libenzi <[EMAIL PROTECTED]> You can do whatever you want with this program, but I'm not responsible for any misuse. Be aware that it can heavily load a host. As it is multithreaded, it might take advantages of SMP. It basically creates a growing amount of threads and measures their cumulative work (i.e. loop iterations/second). The output is easily useable by gnuplot. To compile, you need libpthread : gcc -O2 -fomit-frame-pointer -o threads threads.c -lpthread Output on stdout is : */ #include #include #include #include #include #define MAXTHREADS 450 #define MEASURE_TIME 60 pthread_t thr[MAXTHREADS]; int nbthreads = MAXTHREADS; int measure_time = MEASURE_TIME; volatileactthreads = 0; long long int totalwork[MAXTHREADS]; volatile intstop = 0, start = 0, count = 0; voidoneatwork(int thr) { int i; while (!start) /* don't disturb pthread_create() */ usleep(1); actthreads++; while (!stop) { if (count) totalwork[thr]++; syscall(158); /* sys_sched_yield() */ } actthreads--; pthread_exit(0); } main(int argc, char **argv) { int i, err, avgwork, thrzero; long long int value, avgvalue; double sqrdev; time_t ts, te; if (argc < 3) { printf("usage: %s threads time\n", argv[0]); exit(1); } nbthreads = atoi(argv[1]); measure_time = atoi(argv[2]); start = 0; count = 0; stop = 0; actthreads = 0; thrzero = 0; value = 0; sqrdev = 0.0; fprintf(stderr, "\nCreating %d threads ...", nbthreads); for (i = 0; i < nbthreads; i++) { if ((err = pthread_create(&thr[i], NULL, (void *) &oneatwork, (void *) i)) != 0) { fprintf(stderr, "thread %d pthread_create=%d -> ", i, err); perror(""); exit(1); } pthread_detach(thr[i]); } for (i = 0; i < nbthreads; i++) totalwork[i] = 0; fprintf(stderr, " OK !\nWaiting for all threads to start ..."); start = 1; while (actthreads != nbthreads) usleep(1); /* waiting for a bit of stability */ fprintf(stderr, "Go !\n"); count = 1; time(&ts); sleep(measure_time); count = 0; stop = 1; time(&te); for (i = 0; i < nbthreads; i++) { value += totalwork[i]; if (totalwork[i] == 0) ++thrzero; } avgvalue = value / nbthreads
Re: [Lse-tech] Re: multi-queue scheduler update
Mike sounds good, we will do all our measurements from now on with thread count for the entire range from 1 to 16 and then in power of twos upto 2048 and for maxcpus=1,2,4,6,8. Do you think that 4096 is overkill ? So far the numbers you got and we got over here are the same. Andi suggested that has some problems with IO scheduling. You are right, "hopefully + ballpark" ~= 10%. As for intelligent decisions, the general loadbalancing that we already started might help out a bit here. Other stuff we could look into Remember we talked about counting active idle threads at some point. if (active_idle_threads < smp_num_cpus) { /* now we know that we simply give it to the first idle_thread found, instead of * collecting the max_na_goodness value and somewhat sorting through it * similar to the current Vanilla algorithm */ } else { /* current MQ algorithm */ } Just shooting from the hip here, lets restart this discussion. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 Mike Kravetz <[EMAIL PROTECTED]> on 01/19/2001 12:11:04 PM To: Hubertus Franke/Watson/IBM@IBMUS cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: [Lse-tech] Re: multi-queue scheduler update On Fri, Jan 19, 2001 at 10:47:06AM -0500, Hubertus Franke wrote: > What you can see from these numbers is that MQ does an awesome job up to > 1024 threads. When measuring in the future, we will take from now on the > general concern about low number of threads into account. Your points are > well taken. I m pretty confident our MQ scheduler will be in reasonable > ballpark of the current scheduler. Hubertus, 'Hopefully' the multi-queue scheduler will be in the ballpark for low number of threads. However, remember the extra overhead being incurred in the current implementation. To maintain existing scheduler behavior, we look at all CPU specific runqueues to find the highest priority (goodness) task in the system. Therefore, when running with a single thread on an 8 processor system, we examine 8 runqueues instead of the single global runqueue. In a test where tasks are simply spinning doing sched_yield()s, I suspect this difference may be significant. I'll run the IIRC benchmark with low thread counts, and post the results. In adition, I have some ideas on how to make intelligent decisions to avoid examining all runqueueus when the number of running tasks is less than the number of processors. -- Mike Kravetz [EMAIL PROTECTED] IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: multi-queue scheduler update
In the sched_yield benchmark case this is not a problem , because the threads don't have any memory footprint, all cloned. The chatroom, I agree with you. However, I assume that these big irons (8-ways) will be pretty much loaded with at least 1MB cache. Maybe at this point another cite with an 8-way system and small cache could run this. I don't know whether those actually exists. Alternatively, we could setup a smaller 4-way system (we have a 4-way 300MHZ-P-II Xeon, with 512MB cache) that would fit into your class and we could also collect the numbers on those and post those. We are automizing the reboot process right now where we are modifying the lilol.conf so we can run many tests with different "maxcpus=.." unattended. So little to do, so much time... ahhh make that so little time, so much to do. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 [EMAIL PROTECTED] on 01/19/2001 11:33:34 AM To: Hubertus Franke/Watson/IBM@IBMUS cc: David Lang <[EMAIL PROTECTED]>, Mike Kravetz <[EMAIL PROTECTED]>, Andrea Arcangeli <[EMAIL PROTECTED]>, [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: [Lse-tech] Re: multi-queue scheduler update You might want to rerun the tests with less cache heavy procs. The 2meg xeons you are using could distort things from what the average linux user would see (running with 256-512k cache). Nick On Fri, 19 Jan 2001, Hubertus Franke wrote: > > Sure, we are measuring that as well. > We are running all these benchmarks and configurations that I mentioned in > my previous message on > 1-2-4-6- and 8 way configurations. > We have posted some preliminary results on older kernels on the website: > > http://lse.sourceforge.net/scheduling/prelim.html > > MQ scheduler is meaningless for a UP kernel that is only build under the > SMP flag. > The priority==tablebased scheduler does make sense to run on a UP (i.e. not > SMP compiled) kernel. > Some more fine-tuning of the current code base might improve that case, > because affinity is not a concern > I can simply go to my top table hash, retrieve the first P entry with > !P->has_cpu and I am ready to go. > > Hubertus Franke > Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) > , OS-PIC (Chair) > email: [EMAIL PROTECTED] > (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 > > > > David Lang <[EMAIL PROTECTED]>@lists.sourceforge.net on 01/19/2001 > 11:06:37 AM > > Sent by: [EMAIL PROTECTED] > > > To: Mike Kravetz <[EMAIL PROTECTED]> > cc: Andrea Arcangeli <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, > <[EMAIL PROTECTED]> > Subject: Re: [Lse-tech] Re: multi-queue scheduler update > > > > another thing that would be interesting is what is the overhead on UP or > small (2-4 way) SMP machines > > David Lang > > On Thu, 18 Jan 2001, Mike Kravetz wrote: > > > Date: Thu, 18 Jan 2001 16:52:25 -0800 > > From: Mike Kravetz <[EMAIL PROTECTED]> > > To: Andrea Arcangeli <[EMAIL PROTECTED]> > > Cc: [EMAIL PROTECTED], [EMAIL PROTECTED] > > Subject: Re: [Lse-tech] Re: multi-queue scheduler update > > > > On Fri, Jan 19, 2001 at 01:26:16AM +0100, Andrea Arcangeli wrote: > > > On Thu, Jan 18, 2001 at 03:53:11PM -0800, Mike Kravetz wrote: > > > > Here are some very preliminary numbers from sched_test_yield > > > > (which was previously posted to this (lse-tech) list by Bill > > > > Hartner). Tests were run on a system with 8 700 MHz Pentium > > > > III processors. > > > > > > > >microseconds/yield > > > > # threads 2.2.16-22 2.42.4-multi-queue > > > > - --- > > > > 16 18.7404.603 1.455 > > > > > > I remeber the O(1) scheduler from Davide Libenzi was beating the > mainline O(N) > > > scheduler with over 7 tasks in the runqueue (actually I'm not sure if > the > > > number was 7 but certainly it was under 10). So if you also use a O(1) > > > scheduler too as I guess (since you have a chance to run fast on the > lots of > > > tasks running case) the most interesting thing is how you score with > 2/4/8 > > > tasks in the runqueue (I think the tests on the O(1) scheduler patch > was done > > > at max on a 2-way SMP btw). (the argument for which Davide's patch > wasn't > > > included is that most machines have less than 4/5 tasks in t
Re: [Lse-tech] Re: multi-queue scheduler update
Sure, we are measuring that as well. We are running all these benchmarks and configurations that I mentioned in my previous message on 1-2-4-6- and 8 way configurations. We have posted some preliminary results on older kernels on the website: http://lse.sourceforge.net/scheduling/prelim.html MQ scheduler is meaningless for a UP kernel that is only build under the SMP flag. The priority==tablebased scheduler does make sense to run on a UP (i.e. not SMP compiled) kernel. Some more fine-tuning of the current code base might improve that case, because affinity is not a concern I can simply go to my top table hash, retrieve the first P entry with !P->has_cpu and I am ready to go. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: [EMAIL PROTECTED] (w) 914-945-2003(fax) 914-945-4425 TL: 862-2003 David Lang <[EMAIL PROTECTED]>@lists.sourceforge.net on 01/19/2001 11:06:37 AM Sent by: [EMAIL PROTECTED] To: Mike Kravetz <[EMAIL PROTECTED]> cc: Andrea Arcangeli <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]> Subject: Re: [Lse-tech] Re: multi-queue scheduler update another thing that would be interesting is what is the overhead on UP or small (2-4 way) SMP machines David Lang On Thu, 18 Jan 2001, Mike Kravetz wrote: > Date: Thu, 18 Jan 2001 16:52:25 -0800 > From: Mike Kravetz <[EMAIL PROTECTED]> > To: Andrea Arcangeli <[EMAIL PROTECTED]> > Cc: [EMAIL PROTECTED], [EMAIL PROTECTED] > Subject: Re: [Lse-tech] Re: multi-queue scheduler update > > On Fri, Jan 19, 2001 at 01:26:16AM +0100, Andrea Arcangeli wrote: > > On Thu, Jan 18, 2001 at 03:53:11PM -0800, Mike Kravetz wrote: > > > Here are some very preliminary numbers from sched_test_yield > > > (which was previously posted to this (lse-tech) list by Bill > > > Hartner). Tests were run on a system with 8 700 MHz Pentium > > > III processors. > > > > > >microseconds/yield > > > # threads 2.2.16-22 2.42.4-multi-queue > > > - --- > > > 16 18.7404.603 1.455 > > > > I remeber the O(1) scheduler from Davide Libenzi was beating the mainline O(N) > > scheduler with over 7 tasks in the runqueue (actually I'm not sure if the > > number was 7 but certainly it was under 10). So if you also use a O(1) > > scheduler too as I guess (since you have a chance to run fast on the lots of > > tasks running case) the most interesting thing is how you score with 2/4/8 > > tasks in the runqueue (I think the tests on the O(1) scheduler patch was done > > at max on a 2-way SMP btw). (the argument for which Davide's patch wasn't > > included is that most machines have less than 4/5 tasks in the runqueue at the > > same time) > > > > Andrea > > Thanks for the suggestion. The only reason I hesitated to test with > a small number of threads is because I was under the assumption that > this particular benchmark may have problems if the number of threads > was less than the number of processors. I'll give the tests a try > with a smaller number of threads. I'm also open to suggestions for > what benchmarks/test methods I could use for scheduler testing. If > you remember what people have used in the past, please let me know. > > -- > Mike Kravetz [EMAIL PROTECTED] > IBM Linux Technology Center > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ > ___ Lse-tech mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/lists/listinfo/lse-tech - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: multi-queue scheduler update
Indeed, Andi, we tried that priority==tablebased scheduler approach. If you check the call for participation again, what we are trying to do is to get to the bottom of what actually impacts scheduler performance and subsequently come up with a combined best bread (i.e. satisfies the highend and low end). Since this is still work in progress, here are a few numbers that I got from running the 2.4.0-test12 kernels for vanilla and priority based complementing Mike's numbers. I add this as an extra columns to Mikes table. Our Machine is 8-way 700 MHZ Pentium 2MB caches, though I don't think for the sched_yield test it makes a difference. I ran with 50 seconds runtime per test to get by the FRC problem. microseconds/yield #threads 2.2.16-22 2.42.4-MQ 2.4.0-test12 2.4.0-test12-Prio -- - -- - 16 18.740 4.6031.455 4.51 4.39 32 17.702 5.1341.456 5.01 4.06 64 23.300 5.5861.466 5.70 3.99 12847.273 18.8121.480 12.06 %3.99 256105.70171.1471.517 60.2 4.05 512FRC 143.5001.661 132.5 4.19 1024 FRC 196.4256.166 295.4 # 4.57 2048 FRC FRC 23.291 460.4 5.34 4096 FRC FRC 47.117 631.3 5.91 *FRC = failed to reach confidence level Some comments to some numbers: #) Mike measure 196, I measured 295 ?? Somebody has a typo here I assume. %) This actually varied between 8 and 14 on multiple runs averaging 12. Bill Hartner suggests that these might be cache issues (OT). What you can see from these numbers is that MQ does an awesome job up to 1024 threads. When measuring in the future, we will take from now on the general concern about low number of threads into account. Your points are well taken. I m pretty confident our MQ scheduler will be in reasonable ballpark of the current scheduler. To go on, the priority==tablebased scheduler does better for very high number of processes. It actually beats the vanilla version throughout (>= 16). It stays stable, because we stop immediately when we found a process that run last on the invoking cpu. Only way we could do better is to continue searching for a affinity boost due to . Here the discussions might start. The next version of the tablebased scheduler will take into account whether the table index only covers one goodness range or multiple (e.g. RT). This could give some better performance for the general case. The roadmap ahead for Mike and I and the rest of the crew is to combine these methods. In our first attempt we first wanted to demonstrate that the MQ does a great job while emulating current scheduler semantics. Now if we relax these semantics just a bit, e.g. we would be tolerating a bit more priority inversion (which any scheduler does that deploys affinity boosts), we probably can do even better. These are the things we are currently doing and soon should have some results now: (1) We are preparing for LWE with a full measurement of the latest kernel. For this purpose we have frozen to 2.4.1-pre8. Unless ofcourse you are telling us this is not a good kernel to run on. (2) We will measure 1-4096 threads for vanilla, priority and MQ for two tests (both provided by Bill Hartner in Austin). (a) sched_yield although not a meaningful benchmark, it really exposes the raw overhead of scheduling the problem here it artificially generates lock contention at a rate we would not see in general applications. (b) chatroomsimilar to VolanoBenchmark, but easier to use and measure. This gives a better idea what the impact would be for real applications On the progress side. Now that we already have a good idea what the MQ and the table==priority based scheduler can do, we want to combine them and see how that impacts performance. Next we still have the open issues whether keeping queues in priority order makes sense or not. That exercise should be done for both MQ and table based scheduler. Next, we have started looking into breaking up the CPU set. Right now we scan all CPUs to find an appropriate CPU to preempt. For large number of CPUs that can cost particular with very few number (1-4) of threads. We are currently experimenting with breaking up the CPUs into smaller sets and just schedule with in their set, i.e. we don't look beyond the set to balance (e.g. priorities etc). Occasionally (1HZ) we run a load balancing mechanism to redistribute work. We have a simple prototype running demonstrating the idea.This could be also useful for NUMA systems as well. We will post this patch over the MQ soon on the site.