Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: Without the global locking, the global synchronization here is simply ping-ponging a cache line once of while. This doesn't look expensive to me, but if it does after benchmarking, adjusting sysctl_base_round_slice can reduce the ping-pong frequency. There might also be a smart implementation that can alleviate this problem. Scaling it proportionally to migration cost and log2(cpus) should suffice. I don't understand why quantizing CPU time is a bad thing. Could you educate me on this? It depends on how precisely you do it. We save a lot of power going tickless. If round expiration is re-introducing ticks on idle CPUs, we could waste a lot of power. Hardware is getting even more aggressive about power saving, to the point of allowing individual cores to be completely powered off when idle. We need to make sure the scheduler doesn't interfere with power management. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: On Fri, 27 Jul 2007, Chris Snook wrote: Bill Huey (hui) wrote: You have to consider the target for this kind of code. There are applications where you need something that falls within a constant error bound. According to the numbers, the current CFS rebalancing logic doesn't achieve that to any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more strict than that. I've said from the beginning that I think that anyone who desperately needs perfect fairness should be explicitly enforcing it with the aid of realtime priorities. The problem is that configuring and tuning a realtime application is a pain, and people want to be able to approximate this behavior without doing a whole lot of dirty work themselves. I believe that CFS can and should be enhanced to ensure SMP-fairness over potentially short, user-configurable intervals, even for SCHED_OTHER. I do not, however, believe that we should take it to the extreme of wasting CPU cycles on migrations that will not improve performance for *any* task, just to avoid letting some tasks get ahead of others. We should be as fair as possible but no fairer. If we've already made it as fair as possible, we should account for the margin of error and correct for it the next time we rebalance. We should not burn the surplus just to get rid of it. Proportional-share scheduling actually has one of its roots in real-time and having a p-fair scheduler is essential for real-time apps (soft real-time). Sounds like another scheduler class might be in order. I find CFS to be fair enough for most purposes. If the code that gives us near-perfect fairness at the expense of efficiency only runs when tasks have been given boosted priority by a privileged user, and only on the CPUs that have such tasks queued on them, the run time overhead and code complexity become much smaller concerns. On a non-NUMA box with single-socket, non-SMT processors, a constant error bound is fine. Once we add SMT, go multi-core, go NUMA, and add inter-chassis interconnects on top of that, we need to multiply this error bound at each stage in the hierarchy, or else we'll end up wasting CPU cycles on migrations that actually hurt the processes they're supposed to be helping, and hurt everyone else even more. I believe we should enforce an error bound that is proportional to migration cost. I think we are actually in agreement. When I say constant bound, it can certainly be a constant that's determined based on inputs from the memory hierarchy. The point is that it needs to be a constant independent of things like # of tasks. Agreed. But this patch is only relevant to SCHED_OTHER. The realtime scheduler doesn't have a concept of fairness, just priorities. That why each realtime priority level has its own separate runqueue. Realtime schedulers are supposed to be dumb as a post, so they cannot heuristically decide to do anything other than precisely what you configured them to do, and so they don't get in the way when you're context switching a million times a second. Are you referring to hard real-time? As I said, an infrastructure that enables p-fair scheduling, EDF, or things alike is the foundation for real-time. I designed DWRR, however, with a target of non-RT apps, although I was hoping the research results might be applicable to RT. I'm referring to the static priority SCHED_FIFO and SCHED_RR schedulers, which are (intentionally) dumb as a post, allowing userspace to manage CPU time explicitly. Proportionally fair scheduling is a cool capability, but not a design goal of those schedulers. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Fri, 27 Jul 2007, Chris Snook wrote: Bill Huey (hui) wrote: You have to consider the target for this kind of code. There are applications where you need something that falls within a constant error bound. According to the numbers, the current CFS rebalancing logic doesn't achieve that to any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more strict than that. I've said from the beginning that I think that anyone who desperately needs perfect fairness should be explicitly enforcing it with the aid of realtime priorities. The problem is that configuring and tuning a realtime application is a pain, and people want to be able to approximate this behavior without doing a whole lot of dirty work themselves. I believe that CFS can and should be enhanced to ensure SMP-fairness over potentially short, user-configurable intervals, even for SCHED_OTHER. I do not, however, believe that we should take it to the extreme of wasting CPU cycles on migrations that will not improve performance for *any* task, just to avoid letting some tasks get ahead of others. We should be as fair as possible but no fairer. If we've already made it as fair as possible, we should account for the margin of error and correct for it the next time we rebalance. We should not burn the surplus just to get rid of it. Proportional-share scheduling actually has one of its roots in real-time and having a p-fair scheduler is essential for real-time apps (soft real-time). On a non-NUMA box with single-socket, non-SMT processors, a constant error bound is fine. Once we add SMT, go multi-core, go NUMA, and add inter-chassis interconnects on top of that, we need to multiply this error bound at each stage in the hierarchy, or else we'll end up wasting CPU cycles on migrations that actually hurt the processes they're supposed to be helping, and hurt everyone else even more. I believe we should enforce an error bound that is proportional to migration cost. I think we are actually in agreement. When I say constant bound, it can certainly be a constant that's determined based on inputs from the memory hierarchy. The point is that it needs to be a constant independent of things like # of tasks. But this patch is only relevant to SCHED_OTHER. The realtime scheduler doesn't have a concept of fairness, just priorities. That why each realtime priority level has its own separate runqueue. Realtime schedulers are supposed to be dumb as a post, so they cannot heuristically decide to do anything other than precisely what you configured them to do, and so they don't get in the way when you're context switching a million times a second. Are you referring to hard real-time? As I said, an infrastructure that enables p-fair scheduling, EDF, or things alike is the foundation for real-time. I designed DWRR, however, with a target of non-RT apps, although I was hoping the research results might be applicable to RT. tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Fri, 27 Jul 2007, Chris Snook wrote: I don't think that achieving a constant error bound is always a good thing. We all know that fairness has overhead. If I have 3 threads and 2 processors, and I have a choice between fairly giving each thread 1.0 billion cycles during the next second, or unfairly giving two of them 1.1 billion cycles and giving the other 0.9 billion cycles, then we can have a useful discussion about where we want to draw the line on the fairness/performance tradeoff. On the other hand, if we can give two of them 1.1 billion cycles and still give the other one 1.0 billion cycles, it's madness to waste those 0.2 billion cycles just to avoid user jealousy. The more complex the memory topology of a system, the more "free" cycles you'll get by tolerating short-term unfairness. As a crude heuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but eventually we should take the boot-time computed migration costs into consideration. I think we are in agreement. To avoid confusion, I think we should be more precise on what fairness means. Lag (i.e., ideal fair time - actual service time) is the commonly used metric for fairness. The definition is that a scheduler is proportionally fair if for any task in any time interval, the task's lag is bounded by a constant (note it's in terms of absolute time). The knob here is this constant and can help trade off performance and fairness. The reason for a constant bound is that we want consistent fairness properties regardless of the number of tasks. For example, we don't want the system to be much less fair as the number of tasks increases. With DWRR, the lag bound is the max weight of currently running tasks, multiplied by sysctl_base_round_slice. So if all tasks are of nice 0, i.e., weight 1, and sysctl_base_round_slice equals 30 ms, then we are guaranteed each task is at most 30ms off of the ideal case. This is a useful property. Just like what you mentioned about the migration cost, this property allows the scheduler or user to accurately reason about the tradeoffs. If we want to trade fairness for performance, we can increase sysctl_base_round_slice to, say, 100ms; doing so we also know accurately the worst impact it has on fairness. Adding system calls, while great for research, is not something which is done lightly in the published kernel. If we're going to implement a user interface beyond simply interpreting existing priorities more precisely, it would be nice if this was part of a framework with a broader vision, such as a scheduler economy. Agreed. I've seen papers on scheduler economy but not familiar enough to comment on it. Scheduling Algorithm: The scheduler keeps a set data structures, called Trio groups, to maintain the weight or reservation of each thread group (including one or more threads) and the local weight of each member thread. When scheduling a thread, it consults these data structures and computes (in constant time) a system-wide weight for the thread that represents an equivalent CPU share. Consequently, the scheduling algorithm, DWRR, operates solely based on the system-wide weight (or weight for short, hereafter) of each thread. Having a flat space of system-wide weights for individual threads avoids performing seperate scheduling at each level of the group hierarchy and thus greatly simplies the implementation for group scheduling. Implementing a flat weight space efficiently is nontrivial. I'm curious to see how you reworked the original patch without global locking. I simply removed the locking and changed a little bit in idle_balance(). The lock was trying to avoid a thread from reading or writing the global highest round value while another thread is writing to it. For writes, it's simple to ensure without locking only one write takes effect when multiple writes are concurrent. For the case that there's one write going on and multiple threads read, without locking, the only problem is that a reader may read a stale value and thus thinks the current highest round is X while it's actually X + 1. The end effect is that a thread can be at most two rounds behind the highest round. This changes DWRR's lag bound to 2 * (max weight of current tasks) * sysctl_base_round_slice, which is still constant. I had a feeling this patch was originally designed for the O(1) scheduler, and this is why. The old scheduler had expired arrays, so adding a round-expired array wasn't a radical departure from the design. CFS does not have an expired rbtree, so adding one *is* a radical departure from the design. I think we can implement DWRR or something very similar without using this implementation method. Since we've already got a tree of queued tasks, it might be easiest to basically break off one subtree (usually just one task, but not necessarily) and migrate it to a less loaded tree whenever we can reduce the difference be
Re: [RFC] scheduler: improve SMP fairness in CFS
Bill Huey (hui) wrote: On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote: I don't think that achieving a constant error bound is always a good thing. We all know that fairness has overhead. If I have 3 threads and 2 processors, and I have a choice between fairly giving each thread 1.0 billion cycles during the next second, or unfairly giving two of them 1.1 billion cycles and giving the other 0.9 billion cycles, then we can have a useful discussion about where we want to draw the line on the fairness/performance tradeoff. On the other hand, if we can give two of them 1.1 billion cycles and still give the other one 1.0 billion cycles, it's madness to waste those 0.2 billion cycles just to avoid user jealousy. The more complex the memory topology of a system, the more "free" cycles you'll get by tolerating short-term unfairness. As a crude heuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but eventually we should take the boot-time computed migration costs into consideration. You have to consider the target for this kind of code. There are applications where you need something that falls within a constant error bound. According to the numbers, the current CFS rebalancing logic doesn't achieve that to any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more strict than that. I've said from the beginning that I think that anyone who desperately needs perfect fairness should be explicitly enforcing it with the aid of realtime priorities. The problem is that configuring and tuning a realtime application is a pain, and people want to be able to approximate this behavior without doing a whole lot of dirty work themselves. I believe that CFS can and should be enhanced to ensure SMP-fairness over potentially short, user-configurable intervals, even for SCHED_OTHER. I do not, however, believe that we should take it to the extreme of wasting CPU cycles on migrations that will not improve performance for *any* task, just to avoid letting some tasks get ahead of others. We should be as fair as possible but no fairer. If we've already made it as fair as possible, we should account for the margin of error and correct for it the next time we rebalance. We should not burn the surplus just to get rid of it. On a non-NUMA box with single-socket, non-SMT processors, a constant error bound is fine. Once we add SMT, go multi-core, go NUMA, and add inter-chassis interconnects on top of that, we need to multiply this error bound at each stage in the hierarchy, or else we'll end up wasting CPU cycles on migrations that actually hurt the processes they're supposed to be helping, and hurt everyone else even more. I believe we should enforce an error bound that is proportional to migration cost. Even the rt overload code (from my memory) is subject to these limitations as well until it's moved to use a single global queue while using CPU binding to turn off that logic. It's the price you pay for accuracy. If we allow a little short-term fairness (and I think we should) we can still account for this unfairness and compensate for it (again, with the same tolerance) at the next rebalancing. Again, it's a function of *when* and depends on that application. Adding system calls, while great for research, is not something which is done lightly in the published kernel. If we're going to implement a user interface beyond simply interpreting existing priorities more precisely, it would be nice if this was part of a framework with a broader vision, such as a scheduler economy. I'm not sure what you mean by scheduler economy, but CFS can and should be extended to handle proportional scheduling which is outside of the traditional Unix priority semantics. Having a new API to get at this is unavoidable if you want it to eventually support -rt oriented appications that have bandwidth semantics. A scheduler economy is basically a credit scheduler, augmented to allow processes to exchange credits with each other. If you want to get more sophisticated with fairness, you could price CPU time proportional to load on that CPU. I've been house-hunting lately, so I like to think of it in real estate terms. If you're comfortable with your standard of living and you have enough money, you can rent the apartment in the chic part of town, right next to the subway station. If you want to be more frugal because you're saving for retirement, you can get a place out in the suburbs, but the commute will be more of a pain. If you can't make up your mind and keep moving back and forth, you spend a lot on moving and all your stuff gets dented and scratched. All deadline based schedulers have API mechanisms like this to support extended semantics. This is no different. I had a feeling this patch was originally designed for the O(1) scheduler, and this is why. The old scheduler had expired arrays, so adding a round-expired a
Re: [RFC] scheduler: improve SMP fairness in CFS
On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote: > I don't think that achieving a constant error bound is always a good thing. > We all know that fairness has overhead. If I have 3 threads and 2 > processors, and I have a choice between fairly giving each thread 1.0 > billion cycles during the next second, or unfairly giving two of them 1.1 > billion cycles and giving the other 0.9 billion cycles, then we can have a > useful discussion about where we want to draw the line on the > fairness/performance tradeoff. On the other hand, if we can give two of > them 1.1 billion cycles and still give the other one 1.0 billion cycles, > it's madness to waste those 0.2 billion cycles just to avoid user jealousy. > The more complex the memory topology of a system, the more "free" cycles > you'll get by tolerating short-term unfairness. As a crude heuristic, > scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but > eventually we should take the boot-time computed migration costs into > consideration. You have to consider the target for this kind of code. There are applications where you need something that falls within a constant error bound. According to the numbers, the current CFS rebalancing logic doesn't achieve that to any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more strict than that. Even the rt overload code (from my memory) is subject to these limitations as well until it's moved to use a single global queue while using CPU binding to turn off that logic. It's the price you pay for accuracy. > If we allow a little short-term fairness (and I think we should) we can > still account for this unfairness and compensate for it (again, with the > same tolerance) at the next rebalancing. Again, it's a function of *when* and depends on that application. > Adding system calls, while great for research, is not something which is > done lightly in the published kernel. If we're going to implement a user > interface beyond simply interpreting existing priorities more precisely, it > would be nice if this was part of a framework with a broader vision, such > as a scheduler economy. I'm not sure what you mean by scheduler economy, but CFS can and should be extended to handle proportional scheduling which is outside of the traditional Unix priority semantics. Having a new API to get at this is unavoidable if you want it to eventually support -rt oriented appications that have bandwidth semantics. All deadline based schedulers have API mechanisms like this to support extended semantics. This is no different. > I had a feeling this patch was originally designed for the O(1) scheduler, > and this is why. The old scheduler had expired arrays, so adding a > round-expired array wasn't a radical departure from the design. CFS does > not have an expired rbtree, so adding one *is* a radical departure from the > design. I think we can implement DWRR or something very similar without > using this implementation method. Since we've already got a tree of queued > tasks, it might be easiest to basically break off one subtree (usually just > one task, but not necessarily) and migrate it to a less loaded tree > whenever we can reduce the difference between the load on the two trees by > at least half. This would prevent both overcorrection and undercorrection. > The idea of rounds was another implementation detail that bothered me. In > the old scheduler, quantizing CPU time was a necessary evil. Now that we > can account for CPU time with nanosecond resolution, doing things on an > as-needed basis seems more appropriate, and should reduce the need for > global synchronization. Well, there's nanosecond resolution with no mechanism that exploits it for rebalancing. Rebalancing in general is a pain and the code for it is generally orthogonal to the in-core scheduler data structures that are in use, so I don't understand the objection to this argument and the choice of methods. If it it gets the job done, then these kind of choices don't have that much meaning. > In summary, I think the accounting is sound, but the enforcement is > sub-optimal for the new scheduler. A revision of the algorithm more > cognizant of the capabilities and design of the current scheduler would > seem to be in order. That would be nice. But the amount of error in Tong's solution is much less than the current CFS logic as was previously tested even without consideration to high resolution clocks. So you have to give some kind of credit for that approach and recognized that current methods in CFS are technically a dead end if there's a need for strict fairness in a more rigorous run category than SCHED_OTHER. > I've referenced many times my desire to account for CPU/memory hierarchy in > these patches. At present, I'm not sure we have sufficient infrastructure > in the kernel to automatically optimize for system topology, but I think > whatever de
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: On Fri, 27 Jul 2007, Chris Snook wrote: Tong Li wrote: I'd like to clarify that I'm not trying to push this particular code to the kernel. I'm a researcher. My intent was to point out that we have a problem in the scheduler and my dwrr algorithm can potentially help fix it. The patch itself was merely a proof-of-concept. I'd be thrilled if the algorithm can be proven useful in the real world. I appreciate the people who have given me comments. Since then, I've revised my algorithm/code. Now it doesn't require global locking but retains strong fairness properties (which I was able to prove mathematically). Thanks for doing this work. Please don't take the implementation criticism as a lack of appreciation for the work. I'd like to see dwrr in the scheduler, but I'm skeptical that re-introducing expired runqueues is the most efficient way to do it. Given the inherently controversial nature of scheduler code, particularly that which attempts to enforce fairness, perhaps a concise design document would help us come to an agreement about what we think the scheduler should do and what tradeoffs we're willing to make to do those things. Do you have a design document we could discuss? -- Chris Thanks for the interest. Attached is a design doc I wrote several months ago (with small modifications). It talks about the two pieces of my design: group scheduling and dwrr. The description was based on the original O(1) scheduler, but as my CFS patch showed, the algorithm is applicable to other underlying schedulers as well. It's interesting that I started working on this in January for the purpose of eventually writing a paper about it. So I knew reasonably well the related research work but was totally unaware that people in the Linux community were also working on similar things. This is good. If you are interested, I'd like to help with the algorithms and theory side of the things. tong --- Overview: Trio extends the existing Linux scheduler with support for proportional-share scheduling. It uses a scheduling algorithm, called Distributed Weighted Round-Robin (DWRR), which retains the existing scheduler design as much as possible, and extends it to achieve proportional fairness with O(1) time complexity and a constant error bound, compared to the ideal fair scheduling algorithm. The goal of Trio is not to improve interactive performance; rather, it relies on the existing scheduler for interactivity and extends it to support MP proportional fairness. Trio has two unique features: (1) it enables users to control shares of CPU time for any thread or group of threads (e.g., a process, an application, etc.), and (2) it enables fair sharing of CPU time across multiple CPUs. For example, with ten tasks running on eight CPUs, Trio allows each task to take an equal fraction of the total CPU time. These features enable Trio to complement the existing Linux scheduler to enable greater user flexibility and stronger fairness. Background: Over the years, there has been a lot of criticism that conventional Unix priorities and the nice interface provide insufficient support for users to accurately control CPU shares of different threads or applications. Many have studied scheduling algorithms that achieve proportional fairness. Assuming that each thread has a weight that expresses its desired CPU share, informally, a scheduler is proportionally fair if (1) it is work-conserving, and (2) it allocates CPU time to threads in exact proportion to their weights in any time interval. Ideal proportional fairness is impractical since it requires that all runnable threads be running simultaneously and scheduled with infinitesimally small quanta. In practice, every proportional-share scheduling algorithm approximates the ideal algorithm with the goal of achieving a constant error bound. For more theoretical background, please refer to the following papers: I don't think that achieving a constant error bound is always a good thing. We all know that fairness has overhead. If I have 3 threads and 2 processors, and I have a choice between fairly giving each thread 1.0 billion cycles during the next second, or unfairly giving two of them 1.1 billion cycles and giving the other 0.9 billion cycles, then we can have a useful discussion about where we want to draw the line on the fairness/performance tradeoff. On the other hand, if we can give two of them 1.1 billion cycles and still give the other one 1.0 billion cycles, it's madness to waste those 0.2 billion cycles just to avoid user jealousy. The more complex the memory topology of a system, the more "free" cycles you'll get by tolerating short-term unfairness. As a crude heuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but eventually we should take the boot-time computed migration costs into consideration. [1] A.
Re: [RFC] scheduler: improve SMP fairness in CFS
On Fri, Jul 27, 2007 at 12:03:28PM -0700, Tong Li wrote: > Thanks for the interest. Attached is a design doc I wrote several months > ago (with small modifications). It talks about the two pieces of my design: > group scheduling and dwrr. The description was based on the original O(1) > scheduler, but as my CFS patch showed, the algorithm is applicable to other > underlying schedulers as well. It's interesting that I started working on > this in January for the purpose of eventually writing a paper about it. So > I knew reasonably well the related research work but was totally unaware > that people in the Linux community were also working on similar things. > This is good. If you are interested, I'd like to help with the algorithms > and theory side of the things. Tong, This is sufficient as an overview of the algorithm but not detailed enough for it to be a discussable design doc I believe. You should ask Chris to see what he means by this. Some examples of your rebalancing scheme and how your invariant applies across processor rounds would be helpful for me and possibly others as well. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Fri, 27 Jul 2007, Chris Snook wrote: Tong Li wrote: I'd like to clarify that I'm not trying to push this particular code to the kernel. I'm a researcher. My intent was to point out that we have a problem in the scheduler and my dwrr algorithm can potentially help fix it. The patch itself was merely a proof-of-concept. I'd be thrilled if the algorithm can be proven useful in the real world. I appreciate the people who have given me comments. Since then, I've revised my algorithm/code. Now it doesn't require global locking but retains strong fairness properties (which I was able to prove mathematically). Thanks for doing this work. Please don't take the implementation criticism as a lack of appreciation for the work. I'd like to see dwrr in the scheduler, but I'm skeptical that re-introducing expired runqueues is the most efficient way to do it. Given the inherently controversial nature of scheduler code, particularly that which attempts to enforce fairness, perhaps a concise design document would help us come to an agreement about what we think the scheduler should do and what tradeoffs we're willing to make to do those things. Do you have a design document we could discuss? -- Chris Thanks for the interest. Attached is a design doc I wrote several months ago (with small modifications). It talks about the two pieces of my design: group scheduling and dwrr. The description was based on the original O(1) scheduler, but as my CFS patch showed, the algorithm is applicable to other underlying schedulers as well. It's interesting that I started working on this in January for the purpose of eventually writing a paper about it. So I knew reasonably well the related research work but was totally unaware that people in the Linux community were also working on similar things. This is good. If you are interested, I'd like to help with the algorithms and theory side of the things. tong --- Overview: Trio extends the existing Linux scheduler with support for proportional-share scheduling. It uses a scheduling algorithm, called Distributed Weighted Round-Robin (DWRR), which retains the existing scheduler design as much as possible, and extends it to achieve proportional fairness with O(1) time complexity and a constant error bound, compared to the ideal fair scheduling algorithm. The goal of Trio is not to improve interactive performance; rather, it relies on the existing scheduler for interactivity and extends it to support MP proportional fairness. Trio has two unique features: (1) it enables users to control shares of CPU time for any thread or group of threads (e.g., a process, an application, etc.), and (2) it enables fair sharing of CPU time across multiple CPUs. For example, with ten tasks running on eight CPUs, Trio allows each task to take an equal fraction of the total CPU time. These features enable Trio to complement the existing Linux scheduler to enable greater user flexibility and stronger fairness. Background: Over the years, there has been a lot of criticism that conventional Unix priorities and the nice interface provide insufficient support for users to accurately control CPU shares of different threads or applications. Many have studied scheduling algorithms that achieve proportional fairness. Assuming that each thread has a weight that expresses its desired CPU share, informally, a scheduler is proportionally fair if (1) it is work-conserving, and (2) it allocates CPU time to threads in exact proportion to their weights in any time interval. Ideal proportional fairness is impractical since it requires that all runnable threads be running simultaneously and scheduled with infinitesimally small quanta. In practice, every proportional-share scheduling algorithm approximates the ideal algorithm with the goal of achieving a constant error bound. For more theoretical background, please refer to the following papers: [1] A. K. Parekh and R. G. Gallager. A generalized processor sharing approach to flow control in integrated services networks: The single-node case. IEEE/ACM Transactions on Networking, 1(3):344-357, June 1993. [2] C. R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fair queueing. In Proceedings of IEEE INFOCOM '94, pages 120-128, Mar. 1996. Previous proportional-share scheduling algorithms, however, suffer one or more of the following problems: (1) Inaccurate fairness with non-constant error bounds; (2) High run-time overhead (e.g., logarithmic); (3) Poor scalability due to the use of a global thread queue; (4) Inefficient support for latency-sensitive applications. Since the Linux scheduler has been successful at avoiding problems 2 to 4, this design attempts to extend it with support for accurate proportional fairness while retaining all of its existing benefits. User Interface: By default, each thread is assigned a weight proportional to its stat
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: I'd like to clarify that I'm not trying to push this particular code to the kernel. I'm a researcher. My intent was to point out that we have a problem in the scheduler and my dwrr algorithm can potentially help fix it. The patch itself was merely a proof-of-concept. I'd be thrilled if the algorithm can be proven useful in the real world. I appreciate the people who have given me comments. Since then, I've revised my algorithm/code. Now it doesn't require global locking but retains strong fairness properties (which I was able to prove mathematically). Thanks for doing this work. Please don't take the implementation criticism as a lack of appreciation for the work. I'd like to see dwrr in the scheduler, but I'm skeptical that re-introducing expired runqueues is the most efficient way to do it. Given the inherently controversial nature of scheduler code, particularly that which attempts to enforce fairness, perhaps a concise design document would help us come to an agreement about what we think the scheduler should do and what tradeoffs we're willing to make to do those things. Do you have a design document we could discuss? -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
I'd like to clarify that I'm not trying to push this particular code to the kernel. I'm a researcher. My intent was to point out that we have a problem in the scheduler and my dwrr algorithm can potentially help fix it. The patch itself was merely a proof-of-concept. I'd be thrilled if the algorithm can be proven useful in the real world. I appreciate the people who have given me comments. Since then, I've revised my algorithm/code. Now it doesn't require global locking but retains strong fairness properties (which I was able to prove mathematically). Thank you, tong On Thu, 26 Jul 2007, Li, Tong N wrote: On Thu, 2007-07-26 at 23:31 +0200, Ingo Molnar wrote: * Tong Li <[EMAIL PROTECTED]> wrote: you need to measure it over longer periods of time. Its not worth balancing for such a thing in any high-frequency manner. (we'd trash the cache constantly migrating tasks back and forth.) I have some data below, but before that, I'd like to say, at the same load balancing rate, my proposed approach would allow us to have fairness on the order of seconds. I'm less concerned about trashing the cache. The important thing is to have a knob that allow users to trade off fairness and performance based on their needs. [...] such a knob already exists to a certain degree, but i havent tested its full effects on SMP fairness yet. If you pull my scheduler tree: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git and if you enable CONFIG_SCHED_DEBUG, then all the sched-domain parameters become runtime tunable under /proc/sys/cpu*. Could you try to increase the cross-CPU rebalancing frequency and see how it impacts the precision of your measurement? Tune 'min_interval' and 'max_interval' down to increase the frequency of rebalancing. Ingo Yes, I'll do it when I find time. If anyone is willing to do the testing, please let me know and I can post my benchmark. On the other hand, I don't think tuning the existing knobs will help solve the problem. The root of the problem is that the current load balancing doesn't take into account how much time each task has been running and how much it's entitled. This is why I'm proposing a new approach for solving it. The new approach, as I said, will be much more fair/accurate than the current one even without tuning those balancing intervals (i.e., it's able to provide good fairness even if balancing is less frequent). tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Thu, 2007-07-26 at 23:31 +0200, Ingo Molnar wrote: > * Tong Li <[EMAIL PROTECTED]> wrote: > > > > you need to measure it over longer periods of time. Its not worth > > > balancing for such a thing in any high-frequency manner. (we'd trash > > > the cache constantly migrating tasks back and forth.) > > > > I have some data below, but before that, I'd like to say, at the same > > load balancing rate, my proposed approach would allow us to have > > fairness on the order of seconds. I'm less concerned about trashing > > the cache. The important thing is to have a knob that allow users to > > trade off fairness and performance based on their needs. [...] > > such a knob already exists to a certain degree, but i havent tested its > full effects on SMP fairness yet. If you pull my scheduler tree: > > git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git > > and if you enable CONFIG_SCHED_DEBUG, then all the sched-domain > parameters become runtime tunable under /proc/sys/cpu*. > > Could you try to increase the cross-CPU rebalancing frequency and see > how it impacts the precision of your measurement? Tune 'min_interval' > and 'max_interval' down to increase the frequency of rebalancing. > > Ingo Yes, I'll do it when I find time. If anyone is willing to do the testing, please let me know and I can post my benchmark. On the other hand, I don't think tuning the existing knobs will help solve the problem. The root of the problem is that the current load balancing doesn't take into account how much time each task has been running and how much it's entitled. This is why I'm proposing a new approach for solving it. The new approach, as I said, will be much more fair/accurate than the current one even without tuning those balancing intervals (i.e., it's able to provide good fairness even if balancing is less frequent). tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
* Tong Li <[EMAIL PROTECTED]> wrote: > > you need to measure it over longer periods of time. Its not worth > > balancing for such a thing in any high-frequency manner. (we'd trash > > the cache constantly migrating tasks back and forth.) > > I have some data below, but before that, I'd like to say, at the same > load balancing rate, my proposed approach would allow us to have > fairness on the order of seconds. I'm less concerned about trashing > the cache. The important thing is to have a knob that allow users to > trade off fairness and performance based on their needs. [...] such a knob already exists to a certain degree, but i havent tested its full effects on SMP fairness yet. If you pull my scheduler tree: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git and if you enable CONFIG_SCHED_DEBUG, then all the sched-domain parameters become runtime tunable under /proc/sys/cpu*. Could you try to increase the cross-CPU rebalancing frequency and see how it impacts the precision of your measurement? Tune 'min_interval' and 'max_interval' down to increase the frequency of rebalancing. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Wed, 25 Jul 2007, Ingo Molnar wrote: * Tong Li <[EMAIL PROTECTED]> wrote: > Thanks for the patch. It doesn't work well on my 8-way box. Here's the > output at two different times. It's also changing all the time. you need to measure it over longer periods of time. Its not worth balancing for such a thing in any high-frequency manner. (we'd trash the cache constantly migrating tasks back and forth.) I have some data below, but before that, I'd like to say, at the same load balancing rate, my proposed approach would allow us to have fairness on the order of seconds. I'm less concerned about trashing the cache. The important thing is to have a knob that allow users to trade off fairness and performance based on their needs. This is the same spirit in CFS, which by default can have much more frequent context switches than the old O(1) scheduler. A valid concern is that this is going to hurt cache and this is something that has not yet been benchmarked thoroughly. On the other hand, I fully support CFS on this because I like the idea of having a powerful infrastructure in place and exporting a set of knobs that give users full flexibility. This is also what I'm trying to achieve with my patch for SMP fairness. the runtime ranges from 1:38.66 to 1:46.38 - that's a +-3% spread which is pretty acceptable for ~90 seconds of runtime. Note that the percentages you see above were likely done over a shorter period that's why you see them range from 61% to 91%. > I'm running 2.6.23-rc1 with your patch on a two-socket quad-core > system. Top refresh rate is the default 3s. the 3s is the problem: change that to 60s! We no way want to over-migrate for SMP fairness, the change i did gives us reasonable long-term SMP fairness without the need for high-rate rebalancing. I ran 10 nice 0 tasks on 8 CPUs. Each task does a trivial while (1) loop. I ran the whole thing 300 seconds and, at every 60 seconds, I measured for each task the following: 1. Actual CPU time the task received during the past 60 seconds. 2. Ideal CPU time it would receive under a perfect fair scheduler. 3. Lag = ideal time - actual time Positive lag means the task received less CPU time than its fair share and negative means it received more. 4. Error = lag / ideal time Results of CFS (2.6.23-rc1 with Ingo's SCHED_LOAD_SCALE_FUZZ change): sampling interval 60 sampling length 300 num tasks 10 - Task 0 (unit: second) nice 0 weight 1024 Wall Clock CPU Time Ideal Time Lag Error 04:17:09.314 51.14 48.00 -3.14 -6.54% 04:18:09.316 46.05 48.001.95 4.06% 04:19:09.317 44.13 48.003.87 8.06% 04:20:09.318 44.78 48.003.22 6.71% 04:21:09.320 45.35 48.002.65 5.52% - Task 1 (unit: second) nice 0 weight 1024 Wall Clock CPU Time Ideal Time Lag Error 04:17:09.314 47.06 48.000.94 1.96% 04:18:09.316 45.80 48.002.20 4.58% 04:19:09.317 49.95 48.00 -1.95 -4.06% 04:20:09.318 47.59 48.000.41 0.85% 04:21:09.320 46.50 48.001.50 3.12% - Task 2 (unit: second) nice 0 weight 1024 Wall Clock CPU Time Ideal Time Lag Error 04:17:09.314 48.14 48.00 -0.14 -0.29% 04:18:09.316 48.66 48.00 -0.66 -1.37% 04:19:09.317 47.52 48.000.48 1.00% 04:20:09.318 45.90 48.002.10 4.37% 04:21:09.320 45.89 48.002.11 4.40% - Task 3 (unit: second) nice 0 weight 1024 Wall Clock CPU Time Ideal Time Lag Error 04:17:09.314 48.71 48.00 -0.71 -1.48% 04:18:09.316 44.70 48.003.30 6.87% 04:19:09.317 45.86 48.002.14 4.46% 04:20:09.318 45.85 48.002.15 4.48% 04:21:09.320 44.91 48.003.09 6.44% - Task 4 (unit: second) nice 0 weight 1024 Wall Clock CPU Time Ideal Time Lag Error 04:17:09.314 49.24 48.00 -1.24 -2.58% 04:18:09.316 45.73 48.002.27 4.73% 04:19:09.317 49.13 48.00 -1.13 -2.35% 04:20:09.318 45.37 48.002.63 5.48% 04:21:09.320 45.75 48.002.25 4.69% ---
Re: [RFC] scheduler: improve SMP fairness in CFS
Li, Tong N wrote: On Wed, 2007-07-25 at 16:55 -0400, Chris Snook wrote: Chris Friesen wrote: Ingo Molnar wrote: the 3s is the problem: change that to 60s! We no way want to over-migrate for SMP fairness, the change i did gives us reasonable long-term SMP fairness without the need for high-rate rebalancing. Actually, I do have requirements from our engineering guys for short-term fairness. They'd actually like decent fairness over even shorter intervals...1 second would be nice, 2 is acceptable. They are willing to trade off random peak performance for predictability. Chris The sysctls for CFS have nanosecond resolution. They default to millisecond-order values, but you can set them much lower. See sched_fair.c for the knobs and their explanations. -- Chris This is incorrect. Those knobs control local-CPU fairness granularity but have no control over fairness across CPUs. I'll do some benchmarking as Ingo suggested. tong CFS naturally enforces cross-CPU fairness anyway, as Ingo demonstrated. Lowering the local CPU parameters should cause system-wide fairness to converge faster. It might be worthwhile to create a more explicit knob for this, but I'm inclined to believe we could do it in much less than 700 lines. Ingo's one-liner to improve the 10/8 balancing case, and the resulting improvement, were exactly what I was saying should be possible and desirable. TCP Nagle aside, it generally shouldn't take 700 lines of code to speed up the rate of convergence of something that already converges. Until now I've been watching the scheduler rewrite from the sidelines, but I'm digging into it now. I'll try to give some more constructive criticism soon. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Wed, 2007-07-25 at 16:55 -0400, Chris Snook wrote: > Chris Friesen wrote: > > Ingo Molnar wrote: > > > >> the 3s is the problem: change that to 60s! We no way want to > >> over-migrate for SMP fairness, the change i did gives us reasonable > >> long-term SMP fairness without the need for high-rate rebalancing. > > > > Actually, I do have requirements from our engineering guys for > > short-term fairness. They'd actually like decent fairness over even > > shorter intervals...1 second would be nice, 2 is acceptable. > > > > They are willing to trade off random peak performance for predictability. > > > > Chris > > > > The sysctls for CFS have nanosecond resolution. They default to > millisecond-order values, but you can set them much lower. See sched_fair.c > for > the knobs and their explanations. > > -- Chris This is incorrect. Those knobs control local-CPU fairness granularity but have no control over fairness across CPUs. I'll do some benchmarking as Ingo suggested. tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Friesen wrote: Ingo Molnar wrote: the 3s is the problem: change that to 60s! We no way want to over-migrate for SMP fairness, the change i did gives us reasonable long-term SMP fairness without the need for high-rate rebalancing. Actually, I do have requirements from our engineering guys for short-term fairness. They'd actually like decent fairness over even shorter intervals...1 second would be nice, 2 is acceptable. They are willing to trade off random peak performance for predictability. Chris The sysctls for CFS have nanosecond resolution. They default to millisecond-order values, but you can set them much lower. See sched_fair.c for the knobs and their explanations. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Ingo Molnar wrote: the 3s is the problem: change that to 60s! We no way want to over-migrate for SMP fairness, the change i did gives us reasonable long-term SMP fairness without the need for high-rate rebalancing. Actually, I do have requirements from our engineering guys for short-term fairness. They'd actually like decent fairness over even shorter intervals...1 second would be nice, 2 is acceptable. They are willing to trade off random peak performance for predictability. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
* Tong Li <[EMAIL PROTECTED]> wrote: > Thanks for the patch. It doesn't work well on my 8-way box. Here's the > output at two different times. It's also changing all the time. you need to measure it over longer periods of time. Its not worth balancing for such a thing in any high-frequency manner. (we'd trash the cache constantly migrating tasks back and forth.) > [time 2] > 3721 tongli20 0 7864 108 16 R 91 0.0 1:44.67 loop > 3720 tongli20 0 7864 108 16 R 89 0.0 1:40.55 loop > 3722 tongli20 0 7864 108 16 R 86 0.0 1:41.40 loop > 3723 tongli20 0 7864 108 16 R 82 0.0 1:38.94 loop > 3715 tongli20 0 7864 108 16 R 82 0.0 1:43.09 loop > 3717 tongli20 0 7864 108 16 R 79 0.0 1:46.38 loop > 3718 tongli20 0 7864 108 16 R 79 0.0 1:40.72 loop > 3714 tongli20 0 7864 108 16 R 77 0.0 1:40.17 loop > 3719 tongli20 0 7864 108 16 R 73 0.0 1:35.41 loop > 3716 tongli20 0 7864 108 16 R 61 0.0 1:38.66 loop the runtime ranges from 1:38.66 to 1:46.38 - that's a +-3% spread which is pretty acceptable for ~90 seconds of runtime. Note that the percentages you see above were likely done over a shorter period that's why you see them range from 61% to 91%. > I'm running 2.6.23-rc1 with your patch on a two-socket quad-core > system. Top refresh rate is the default 3s. the 3s is the problem: change that to 60s! We no way want to over-migrate for SMP fairness, the change i did gives us reasonable long-term SMP fairness without the need for high-rate rebalancing. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
* Li, Tong N <[EMAIL PROTECTED]> wrote: > The problem I see is that the current load balancing code is quite > hueristic and can be quite inaccurate sometimes. Its goal is to > maintain roughly equal load on each core, where load is defined to be > the sum of the weights of all tasks on a core. If all tasks have the > same weight, this is a simple problem. If different weights exist, > this is an NP-hard problem and our current hueristic can perform badly > under various workloads. A simple example, if we have four tasks on > two cores and they have weights 1, 5, 7, 7. The balanced partition > would have 1 and 7 on core 1 and 5 and 7 on core 2. But you can see > the load isn't evenly distributed; in fact, it's not possible to > evenly distribute the load in this case. Thus, my opinion is that only > balancing the load as we do now is not enough to achieve SMP fairness. > My patch was intended to address this problem. could you please try my patch nevertheless and see how much it differs in practice from your solution, and cite specific examples and measurements? (please measure over longer periods of time such as 120 seconds or 300 seconds - so that slow-rate balancing has its chance to work) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Wed, 2007-07-25 at 14:03 +0200, Ingo Molnar wrote: > Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> > --- > include/linux/sched.h |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: linux/include/linux/sched.h > === > --- linux.orig/include/linux/sched.h > +++ linux/include/linux/sched.h > @@ -681,7 +681,7 @@ enum cpu_idle_type { > #define SCHED_LOAD_SHIFT 10 > #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT) > > -#define SCHED_LOAD_SCALE_FUZZ(SCHED_LOAD_SCALE >> 5) > +#define SCHED_LOAD_SCALE_FUZZ(SCHED_LOAD_SCALE >> 1) > > #ifdef CONFIG_SMP > #define SD_LOAD_BALANCE 1 /* Do load balancing on this > domain. */ Hi Ingo, The problem I see is that the current load balancing code is quite hueristic and can be quite inaccurate sometimes. Its goal is to maintain roughly equal load on each core, where load is defined to be the sum of the weights of all tasks on a core. If all tasks have the same weight, this is a simple problem. If different weights exist, this is an NP-hard problem and our current hueristic can perform badly under various workloads. A simple example, if we have four tasks on two cores and they have weights 1, 5, 7, 7. The balanced partition would have 1 and 7 on core 1 and 5 and 7 on core 2. But you can see the load isn't evenly distributed; in fact, it's not possible to evenly distribute the load in this case. Thus, my opinion is that only balancing the load as we do now is not enough to achieve SMP fairness. My patch was intended to address this problem. tong PS. I now have a lockless implementation of my patch. Will post later today. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Wed, 25 Jul 2007, Ingo Molnar wrote: they now nicely converte to the expected 80% long-term CPU usage. so, could you please try the patch below, does it work for you too? Thanks for the patch. It doesn't work well on my 8-way box. Here's the output at two different times. It's also changing all the time. [time 1] 3717 tongli20 0 7864 108 16 R 85 0.0 1:38.00 loop 3721 tongli20 0 7864 108 16 R 85 0.0 1:37.51 loop 3723 tongli20 0 7864 108 16 R 84 0.0 1:30.48 loop 3714 tongli20 0 7864 108 16 R 83 0.0 1:33.25 loop 3720 tongli20 0 7864 108 16 R 81 0.0 1:31.89 loop 3715 tongli20 0 7864 108 16 R 78 0.0 1:36.38 loop 3719 tongli20 0 7864 108 16 R 78 0.0 1:28.80 loop 3718 tongli20 0 7864 108 16 R 78 0.0 1:33.84 loop 3722 tongli20 0 7864 108 16 R 76 0.0 1:34.69 loop 3716 tongli20 0 7864 108 16 R 70 0.0 1:33.15 loop [time 2] 3721 tongli20 0 7864 108 16 R 91 0.0 1:44.67 loop 3720 tongli20 0 7864 108 16 R 89 0.0 1:40.55 loop 3722 tongli20 0 7864 108 16 R 86 0.0 1:41.40 loop 3723 tongli20 0 7864 108 16 R 82 0.0 1:38.94 loop 3715 tongli20 0 7864 108 16 R 82 0.0 1:43.09 loop 3717 tongli20 0 7864 108 16 R 79 0.0 1:46.38 loop 3718 tongli20 0 7864 108 16 R 79 0.0 1:40.72 loop 3714 tongli20 0 7864 108 16 R 77 0.0 1:40.17 loop 3719 tongli20 0 7864 108 16 R 73 0.0 1:35.41 loop 3716 tongli20 0 7864 108 16 R 61 0.0 1:38.66 loop I'm running 2.6.23-rc1 with your patch on a two-socket quad-core system. Top refresh rate is the default 3s. tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > > This patch extends CFS to achieve better fairness for SMPs. For > > example, with 10 tasks (same priority) on 8 CPUs, it enables each task > > to receive equal CPU time (80%). [...] > > hm, CFS should already offer reasonable long-term SMP fairness. It > certainly works on a dual-core box, i just started 3 tasks of the same > priority on 2 CPUs, and on vanilla 2.6.23-rc1 the distribution is > this: > > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 7084 mingo 20 0 1576 248 196 R 67 0.0 0:50.13 loop > 7083 mingo 20 0 1576 244 196 R 66 0.0 0:48.86 loop > 7085 mingo 20 0 1576 244 196 R 66 0.0 0:49.45 loop > > so each task gets a perfect 66% of CPU time. > > prior CFS, we indeed did a 50%/50%/100% split - so for example on > v2.6.22: > > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 2256 mingo 25 0 1580 248 196 R 100 0.0 1:03.19 loop > 2255 mingo 25 0 1580 248 196 R 50 0.0 0:31.79 loop > 2257 mingo 25 0 1580 248 196 R 50 0.0 0:31.69 loop > > but CFS has changed that behavior. > > I'll check your 10-tasks-on-8-cpus example on an 8-way box too, maybe > we regressed somewhere ... ok, i just tried it on an 8-cpu box and indeed, unlike the dual-core case, the scheduler does not distribute tasks well enough: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2572 mingo 20 0 1576 244 196 R 100 0.0 1:03.61 loop 2578 mingo 20 0 1576 248 196 R 100 0.0 1:03.59 loop 2576 mingo 20 0 1576 248 196 R 100 0.0 1:03.52 loop 2571 mingo 20 0 1576 244 196 R 100 0.0 1:03.46 loop 2569 mingo 20 0 1576 244 196 R 99 0.0 1:03.36 loop 2570 mingo 20 0 1576 244 196 R 95 0.0 1:00.55 loop 2577 mingo 20 0 1576 248 196 R 50 0.0 0:31.88 loop 2574 mingo 20 0 1576 248 196 R 50 0.0 0:31.87 loop 2573 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop 2575 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop but this is relatively easy to fix - with the patch below applied, it looks a lot better: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2681 mingo 20 0 1576 244 196 R 85 0.0 3:51.68 loop 2688 mingo 20 0 1576 244 196 R 81 0.0 3:46.35 loop 2682 mingo 20 0 1576 244 196 R 80 0.0 3:43.68 loop 2685 mingo 20 0 1576 248 196 R 80 0.0 3:45.97 loop 2683 mingo 20 0 1576 248 196 R 80 0.0 3:40.25 loop 2679 mingo 20 0 1576 244 196 R 80 0.0 3:33.53 loop 2680 mingo 20 0 1576 244 196 R 79 0.0 3:43.53 loop 2686 mingo 20 0 1576 244 196 R 79 0.0 3:39.31 loop 2687 mingo 20 0 1576 244 196 R 78 0.0 3:33.31 loop 2684 mingo 20 0 1576 244 196 R 77 0.0 3:27.52 loop they now nicely converte to the expected 80% long-term CPU usage. so, could you please try the patch below, does it work for you too? Ingo ---> Subject: sched: increase SCHED_LOAD_SCALE_FUZZ From: Ingo Molnar <[EMAIL PROTECTED]> increase SCHED_LOAD_SCALE_FUZZ that adds a small amount of over-balancing: to help distribute CPU-bound tasks more fairly on SMP systems. the problem of unfair balancing was noticed and reported by Tong N Li. 10 CPU-bound tasks running on 8 CPUs, v2.6.23-rc1: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2572 mingo 20 0 1576 244 196 R 100 0.0 1:03.61 loop 2578 mingo 20 0 1576 248 196 R 100 0.0 1:03.59 loop 2576 mingo 20 0 1576 248 196 R 100 0.0 1:03.52 loop 2571 mingo 20 0 1576 244 196 R 100 0.0 1:03.46 loop 2569 mingo 20 0 1576 244 196 R 99 0.0 1:03.36 loop 2570 mingo 20 0 1576 244 196 R 95 0.0 1:00.55 loop 2577 mingo 20 0 1576 248 196 R 50 0.0 0:31.88 loop 2574 mingo 20 0 1576 248 196 R 50 0.0 0:31.87 loop 2573 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop 2575 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop v2.6.23-rc1 + patch: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2681 mingo 20 0 1576 244 196 R 85 0.0 3:51.68 loop 2688 mingo 20 0 1576 244 196 R 81 0.0 3:46.35 loop 2682 mingo 20 0 1576 244 196 R 80 0.0 3:43.68 loop 2685 mingo 20 0 1576 248 196 R 80 0.0 3:45.97 loop 2683 mingo 20 0 1576 248 196 R 80 0.0 3:40.25 loop 2679 mingo 20 0 1576 244 196 R 80 0.0 3:33.53 loop 2680 mingo 20 0 1576 244 196 R 79 0.0 3:43.53 loop 2686 mingo 20 0 1576 244 196 R 79 0.0 3:39.31 loop 2687 mingo 20 0 1576 244 196 R 78 0.0 3:33.31 loop 2684 mingo 20 0 1576 244 196 R 77 0.0 3:27.52 loop so they now nicel
Re: [RFC] scheduler: improve SMP fairness in CFS
* Tong Li <[EMAIL PROTECTED]> wrote: > This patch extends CFS to achieve better fairness for SMPs. For > example, with 10 tasks (same priority) on 8 CPUs, it enables each task > to receive equal CPU time (80%). [...] hm, CFS should already offer reasonable long-term SMP fairness. It certainly works on a dual-core box, i just started 3 tasks of the same priority on 2 CPUs, and on vanilla 2.6.23-rc1 the distribution is this: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 7084 mingo 20 0 1576 248 196 R 67 0.0 0:50.13 loop 7083 mingo 20 0 1576 244 196 R 66 0.0 0:48.86 loop 7085 mingo 20 0 1576 244 196 R 66 0.0 0:49.45 loop so each task gets a perfect 66% of CPU time. prior CFS, we indeed did a 50%/50%/100% split - so for example on v2.6.22: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2256 mingo 25 0 1580 248 196 R 100 0.0 1:03.19 loop 2255 mingo 25 0 1580 248 196 R 50 0.0 0:31.79 loop 2257 mingo 25 0 1580 248 196 R 50 0.0 0:31.69 loop but CFS has changed that behavior. I'll check your 10-tasks-on-8-cpus example on an 8-way box too, maybe we regressed somewhere ... Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Snook wrote: A fraction of *each* CPU, or a fraction of *total* CPU? Per-cpu granularity doesn't make anything more fair. Well, our current solution uses per-cpu weights, because our vendor couldn't get the load balancer working accurately enough. Having per-cpu weights and cpu affinity gives acceptable results for the case where we're currently using it. If the load balancer is good enough, per-system weights would be fine. It would have to play nicely with affinity though, in the case where it makes sense to lock tasks to particular cpus. If I have two threads with the same priority, and two CPUs, the scheduler will put one on each CPU, and they'll run happily without any migration or balancing. Sure. Now add a third thread. How often do you migrate? Put another way, over what time quantum do we ensure fairness? Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Tue, Jul 24, 2007 at 05:22:47PM -0400, Chris Snook wrote: > Bill Huey (hui) wrote: > Well, you need enough CPU time to meet your deadlines. You need > pre-allocated memory, or to be able to guarantee that you can allocate > memory fast enough to meet your deadlines. This principle extends to any > other shared resource, such as disk or network. I'm being vague because > it's open-ended. If a medical device fails to meet realtime guarantees > because the battery fails, the patient's family isn't going to care how > correct the software is. Realtime engineering is hard. ... > Actually, it's worse than merely an open problem. A clairvoyant fair > scheduler with perfect future knowledge can underperform a heuristic fair > scheduler, because the heuristic scheduler can guess the future incorrectly > resulting in unfair but higher-throughput behavior. This is a perfect > example of why we only try to be as fair as is beneficial. I'm glad we agree on the above points. :) It might be that there needs to be another more stiff policy than what goes into SCHED_OTHER in that we also need a SCHED_ISO or something has more strict rebalancing semantics for -rt applications, sort be a super SCHED_RR. That's definitely needed and I don't see how the current CFS implementation can deal with this properly even with numerical running averages, etc... at this time. SCHED_FIFO is another issue, but this actually more complicated than just per cpu run queues in that a global priority analysis. I don't see how CFS can deal with SCHED_FIFO efficiently without moving to a single run queue. This is kind of a complicated problem with a significant set of trade off to take into account (cpu binding, etc..) >> Tong's previous trio patch is an attempt at resolving this using a generic >> grouping mechanism and some constructive discussion should come of it. > > Sure, but it seems to me to be largely orthogonal to this patch. It's based on the same kinds of ideas that he's been experimenting with in Trio. I can't name a single other engineer that's posted to lkml recently that has quite the depth of experience in this area than him. It would be nice to facilitted/incorporate some his ideas or get him to and work on something to this end that's suitable for inclusion in some tree some where. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Friesen wrote: Chris Snook wrote: I don't think Chris's scenario has much bearing on your patch. What he wants is to have a task that will always be running, but can't monopolize either CPU. This is useful for certain realtime workloads, but as I've said before, realtime requires explicit resource allocation. I don't think this is very relevant to SCHED_FAIR balancing. I'm not actually using the scenario I described, its just sort of a worst-case load-balancing thought experiment. What we want to be able to do is to specify a fraction of each cpu for each task group. We don't want to have to affine tasks to particular cpus. A fraction of *each* CPU, or a fraction of *total* CPU? Per-cpu granularity doesn't make anything more fair. You've got a big bucket of MIPS you want to divide between certain groups, but it shouldn't make a difference which CPUs those MIPS come from, other than the fact that we try to minimize overhead induced by migration. This means that the load balancer must be group-aware, and must trigger a re-balance (possibly just for a particular group) as soon as the cpu allocation for that group is used up on a particular cpu. If I have two threads with the same priority, and two CPUs, the scheduler will put one on each CPU, and they'll run happily without any migration or balancing. It sounds like you're saying that every X milliseconds, you want both to expire, be forbidden from running on the current CPU for the next X milliseconds, and then migrated to the other CPU. There's no gain in fairness here, and there's a big drop in performance. I suggested local fairness as a means to achieve global fairness because it could reduce overhead, and by adding the margin of error at each level in the locality hierarchy, you can get an algorithm which naturally tolerates the level of unfairness beyond which it is impossible to optimize. Strict local fairness for its own sake doesn't accomplish anything that's better than global fairness. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote: > Chris Friesen wrote: >> We currently use CKRM on an SMP machine, but the only way we can get away >> with it is because our main app is affined to one cpu and just about >> everything else is affined to the other. > > If you're not explicitly allocating resources, you're just low-latency, not > truly realtime. Realtime requires guaranteed resources, so messing with > affinities is a necessary evil. You've mentioned this twice in this thread. If you're going to talk about this you should characterize this more specifically because resource allocation is a rather incomplete area in the Linux. Rebalancing is still an open research problem the last time I looked. Tong's previous trio patch is an attempt at resolving this using a generic grouping mechanism and some constructive discussion should come of it. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Snook wrote: I don't think Chris's scenario has much bearing on your patch. What he wants is to have a task that will always be running, but can't monopolize either CPU. This is useful for certain realtime workloads, but as I've said before, realtime requires explicit resource allocation. I don't think this is very relevant to SCHED_FAIR balancing. I'm not actually using the scenario I described, its just sort of a worst-case load-balancing thought experiment. What we want to be able to do is to specify a fraction of each cpu for each task group. We don't want to have to affine tasks to particular cpus. This means that the load balancer must be group-aware, and must trigger a re-balance (possibly just for a particular group) as soon as the cpu allocation for that group is used up on a particular cpu. I haven't tried the latest CFS group scheduler patches, so they may provide the sort of accuracy we're looking for. I'll have to try and find some time to test them out. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Bill Huey (hui) wrote: On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote: Chris Friesen wrote: We currently use CKRM on an SMP machine, but the only way we can get away with it is because our main app is affined to one cpu and just about everything else is affined to the other. If you're not explicitly allocating resources, you're just low-latency, not truly realtime. Realtime requires guaranteed resources, so messing with affinities is a necessary evil. You've mentioned this twice in this thread. If you're going to talk about this you should characterize this more specifically because resource allocation is a rather incomplete area in the Linux. Well, you need enough CPU time to meet your deadlines. You need pre-allocated memory, or to be able to guarantee that you can allocate memory fast enough to meet your deadlines. This principle extends to any other shared resource, such as disk or network. I'm being vague because it's open-ended. If a medical device fails to meet realtime guarantees because the battery fails, the patient's family isn't going to care how correct the software is. Realtime engineering is hard. Rebalancing is still an open research problem the last time I looked. Actually, it's worse than merely an open problem. A clairvoyant fair scheduler with perfect future knowledge can underperform a heuristic fair scheduler, because the heuristic scheduler can guess the future incorrectly resulting in unfair but higher-throughput behavior. This is a perfect example of why we only try to be as fair as is beneficial. Tong's previous trio patch is an attempt at resolving this using a generic grouping mechanism and some constructive discussion should come of it. Sure, but it seems to me to be largely orthogonal to this patch. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Snook wrote: We have another SMP box that would benefit from group scheduling, but we can't use it because the load balancer is not nearly good enough. Which scheduler? Have you tried the CFS group scheduler patches? CKRM as well. Haven't tried the CFS group scheduler, as we're stuck on a way old kernel and I've been too busy to play with the new stuff. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Li, Tong N wrote: On Tue, 2007-07-24 at 16:39 -0400, Chris Snook wrote: Divining the intentions of the administrator is an AI-complete problem and we're not going to try to solve that in the kernel. An intelligent administrator could also allocate 50% of each CPU to a resource group containing all the *other* processes. Then, when the other processes are scheduled out, your single task will run on whichever CPU is idle. This will very quickly equilibrate to the scheduling ping-pong you seem to want. The scheduler deliberately avoids this kind of migration by default because it hurts cache and TLB performance, so if you want to override this very sane default behavior, you're going to have to explicitly configure it yourself. Well, the admin wouldn't specifically ask for 50% of each CPU. He would just allocate 50% of total CPU time---it's up to the scheduler to fulfill that. If a task is entitled to one CPU, then it'll stay there and have no migration. Migration occurs only if there's overload, in which case I think you agree in your last email that the cache and TLB impact is not an issue (at least in SMP). I don't think Chris's scenario has much bearing on your patch. What he wants is to have a task that will always be running, but can't monopolize either CPU. This is useful for certain realtime workloads, but as I've said before, realtime requires explicit resource allocation. I don't think this is very relevant to SCHED_FAIR balancing. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Tue, 2007-07-24 at 16:39 -0400, Chris Snook wrote: > Divining the intentions of the administrator is an AI-complete problem and > we're > not going to try to solve that in the kernel. An intelligent administrator > could also allocate 50% of each CPU to a resource group containing all the > *other* processes. Then, when the other processes are scheduled out, your > single task will run on whichever CPU is idle. This will very quickly > equilibrate to the scheduling ping-pong you seem to want. The scheduler > deliberately avoids this kind of migration by default because it hurts cache > and > TLB performance, so if you want to override this very sane default behavior, > you're going to have to explicitly configure it yourself. > Well, the admin wouldn't specifically ask for 50% of each CPU. He would just allocate 50% of total CPU time---it's up to the scheduler to fulfill that. If a task is entitled to one CPU, then it'll stay there and have no migration. Migration occurs only if there's overload, in which case I think you agree in your last email that the cache and TLB impact is not an issue (at least in SMP). tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Friesen wrote: Chris Snook wrote: Concerns aside, I agree that fairness is important, and I'd really like to see a test case that demonstrates the problem. One place that might be useful is the case of fairness between resource groups, where the load balancer needs to consider each group separately. You mean like the CFS group scheduler patches? I don't see how this patch is related to that, besides working on top of it. Now it may be the case that trying to keep the load of each class within X% of the other cpus is sufficient, but it's not trivial. I agree. My suggestion is that we try being fair from the bottom-up, rather than top-down. If most of the rebalancing is local, we can minimize expensive locking and cross-node migrations, and scale very nicely on large NUMA boxes. Consider the case where you have a resource group that is allocated 50% of each cpu in a dual cpu system, and only have a single task in that group. This means that in order to make use of the full group allocation, that task needs to be load-balanced to the other cpu as soon as it gets scheduled out. Most load-balancers can't handle that kind of granularity, but I have guys in our engineering team that would really like this level of performance. Divining the intentions of the administrator is an AI-complete problem and we're not going to try to solve that in the kernel. An intelligent administrator could also allocate 50% of each CPU to a resource group containing all the *other* processes. Then, when the other processes are scheduled out, your single task will run on whichever CPU is idle. This will very quickly equilibrate to the scheduling ping-pong you seem to want. The scheduler deliberately avoids this kind of migration by default because it hurts cache and TLB performance, so if you want to override this very sane default behavior, you're going to have to explicitly configure it yourself. We currently use CKRM on an SMP machine, but the only way we can get away with it is because our main app is affined to one cpu and just about everything else is affined to the other. If you're not explicitly allocating resources, you're just low-latency, not truly realtime. Realtime requires guaranteed resources, so messing with affinities is a necessary evil. We have another SMP box that would benefit from group scheduling, but we can't use it because the load balancer is not nearly good enough. Which scheduler? Have you tried the CFS group scheduler patches? -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Snook wrote: Concerns aside, I agree that fairness is important, and I'd really like to see a test case that demonstrates the problem. One place that might be useful is the case of fairness between resource groups, where the load balancer needs to consider each group separately. Now it may be the case that trying to keep the load of each class within X% of the other cpus is sufficient, but it's not trivial. Consider the case where you have a resource group that is allocated 50% of each cpu in a dual cpu system, and only have a single task in that group. This means that in order to make use of the full group allocation, that task needs to be load-balanced to the other cpu as soon as it gets scheduled out. Most load-balancers can't handle that kind of granularity, but I have guys in our engineering team that would really like this level of performance. We currently use CKRM on an SMP machine, but the only way we can get away with it is because our main app is affined to one cpu and just about everything else is affined to the other. We have another SMP box that would benefit from group scheduling, but we can't use it because the load balancer is not nearly good enough. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: On Mon, 23 Jul 2007, Chris Snook wrote: This patch is massive overkill. Maybe you're not seeing the overhead on your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT workload. Do you have any data justifying the need for this patch? Doing anything globally is expensive, and should be avoided at all costs. The scheduler already rebalances when a CPU is idle, so you're really just rebalancing the overload here. On a server workload, we don't necessarily want to do that, since the overload may be multiple threads spawned to service a single request, and could be sharing a lot of data. Instead of an explicit system-wide fairness invariant (which will get very hard to enforce when you throw SCHED_FIFO processes into the mix and the scheduler isn't running on some CPUs), try a simpler invariant. If we guarantee that the load on CPU X does not differ from the load on CPU (X+1)%N by more than some small constant, then we know that the system is fairly balanced. We can achieve global fairness with local balancing, and avoid all this overhead. This has the added advantage of keeping most of the migrations core/socket/node-local on SMT/multicore/NUMA systems. Chris, These are all good comments. Thanks. I see three concerns and I'll try to address each. 1. Unjustified effort/cost My view is that fairness (or proportional fairness) is a first-order metric and necessary in many cases even at the cost of performance. In the cases where it's critical, we have realtime. In the cases where it's important, this implementation won't keep latency low enough to make people happier. If you've got a test case to prove me wrong, I'd like to see it. A server running multiple client apps certainly doesn't want the clients to see that they are getting different amounts of service, assuming the clients are of equal importance (priority). A conventional server receives client requests, does a brief amount of work, and then gives a response. This patch doesn't help that workload. This patch helps the case where you've got batch jobs running on a slightly overloaded compute server, and unfairness means you end up waiting for a couple threads to finish at the end while CPUs sit idle. I don't think it's that big of a problem, and if it is, I think we can solve it in a more elegant way than reintroducing expired queues. When the clients have different priorities, the server also wants to give them service time proportional to their priority/weight. The same is true for desktops, where users want to nice tasks and see an effect that's consistent with what they expect, i.e., task CPU time should be proportional to their nice values. The point is that it's important to enforce fairness because it enables users to control the system in a deterministic way and it helps each task get good response time. CFS achieves this on local CPUs and this patch makes the support stronger for SMPs. It's overkill to enforce unnecessary degree of fairness, but it is necessary to enforce an error bound, even if large, such that the user can reliably know what kind of CPU time (even performance) he'd get after making a nice value change. Doesn't CFS already do this? This patch ensures an error bound of (max task weight currently in system) * sysctl_base_round_slice compared to an idealized fair system. The thing that bugs me about this is the diminishing returns. It looks like it will only give a substantial benefit when system load is somewhere between 1.0 and 2.0. On a heavily-loaded system, CFS will do the right thing within a good margin of error, and on an underloaded system, even a naive scheduler will do the right thing. If you want to optimize smp fairness in this range, that's great, but there's probably a lighter-weight way to do it. 2. High performance overhead Two sources of overhead: (1) the global rw_lock, and (2) task migrations. I agree they can be problems on NUMA, but I'd argue they are not on SMPs. Any global lock can cause two performance problems: (1) serialization, and (2) excessive remote cache accesses and traffic. IMO (1) is not a problem since this is a rw_lock and a write_lock occurs infrequently only when all tasks in the system finish the current round. (2) could be a problem as every read/write lock causes an invalidation. It could be improved by using Nick's ticket lock. On the other hand, this is a single cache line and it's invalidated only when a CPU finishes all tasks in its local active RB tree, where each nice 0 task takes sysctl_base_round_slice (e.g., 30ms) to finish, so it looks to me the invalidations would be infrequent enough and could be noise in the whole system. Task migrations don't bother me all that much. Since we're migrating the *overload*, I expect those processes to be fairly cache-cold whenever we get around to them anyway. It'd be nice to be SMT/multicore/NU
Re: [RFC] scheduler: improve SMP fairness in CFS
On Tue, 2007-07-24 at 04:07 -0400, Chris Snook wrote: > To clarify, I'm not suggesting that the "balance with cpu (x+1)%n only" > algorithm is the only way to do this. Rather, I'm pointing out that > even an extremely simple algorithm can give you fair loading when you > already have CFS managing the runqueues. There are countless more > sophisticated ways we could do this without using global locking, or > possibly without any locking at all, other than the locking we already > use during migration. > > -- Chris Yes, as Andi and Chris also pointed out, I'll think about if global synchronization can be removed or relaxed. Thanks, tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Mon, 23 Jul 2007, Chris Snook wrote: This patch is massive overkill. Maybe you're not seeing the overhead on your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT workload. Do you have any data justifying the need for this patch? Doing anything globally is expensive, and should be avoided at all costs. The scheduler already rebalances when a CPU is idle, so you're really just rebalancing the overload here. On a server workload, we don't necessarily want to do that, since the overload may be multiple threads spawned to service a single request, and could be sharing a lot of data. Instead of an explicit system-wide fairness invariant (which will get very hard to enforce when you throw SCHED_FIFO processes into the mix and the scheduler isn't running on some CPUs), try a simpler invariant. If we guarantee that the load on CPU X does not differ from the load on CPU (X+1)%N by more than some small constant, then we know that the system is fairly balanced. We can achieve global fairness with local balancing, and avoid all this overhead. This has the added advantage of keeping most of the migrations core/socket/node-local on SMT/multicore/NUMA systems. Chris, These are all good comments. Thanks. I see three concerns and I'll try to address each. 1. Unjustified effort/cost My view is that fairness (or proportional fairness) is a first-order metric and necessary in many cases even at the cost of performance. A server running multiple client apps certainly doesn't want the clients to see that they are getting different amounts of service, assuming the clients are of equal importance (priority). When the clients have different priorities, the server also wants to give them service time proportional to their priority/weight. The same is true for desktops, where users want to nice tasks and see an effect that's consistent with what they expect, i.e., task CPU time should be proportional to their nice values. The point is that it's important to enforce fairness because it enables users to control the system in a deterministic way and it helps each task get good response time. CFS achieves this on local CPUs and this patch makes the support stronger for SMPs. It's overkill to enforce unnecessary degree of fairness, but it is necessary to enforce an error bound, even if large, such that the user can reliably know what kind of CPU time (even performance) he'd get after making a nice value change. This patch ensures an error bound of (max task weight currently in system) * sysctl_base_round_slice compared to an idealized fair system. 2. High performance overhead Two sources of overhead: (1) the global rw_lock, and (2) task migrations. I agree they can be problems on NUMA, but I'd argue they are not on SMPs. Any global lock can cause two performance problems: (1) serialization, and (2) excessive remote cache accesses and traffic. IMO (1) is not a problem since this is a rw_lock and a write_lock occurs infrequently only when all tasks in the system finish the current round. (2) could be a problem as every read/write lock causes an invalidation. It could be improved by using Nick's ticket lock. On the other hand, this is a single cache line and it's invalidated only when a CPU finishes all tasks in its local active RB tree, where each nice 0 task takes sysctl_base_round_slice (e.g., 30ms) to finish, so it looks to me the invalidations would be infrequent enough and could be noise in the whole system. The patch can introduce more task migrations. I don't think it's a problem in SMPs. For one, CFS already is doing context switches more frequently than before, and thus even if a task doesn't migration, it may still miss in the local cache because the previous task kicked out its data. In this sense, migration doesn't add more cache misses. Now, in NUMA, the penalty of a cache miss can be much higher if we migrate off-node. Here, I agree that this can be a problem. For NUMA, I'd expect the user to tune sysctl_base_round_slice to a large enough value to avoid frequent migrations (or just affinitize tasks). Benchmarking could also help us determine a reasonable default sysctl_base_round_slice for NUMA. 3. Hard to enforce fairness when there are non-SCHED_FAIR tasks All we want is to enforce fairness among the SCHED_FAIR tasks. Leveraging CFS, this patch makes sure relative CPU time of two SCHED_FAIR tasks in an SMP equals their weight ratio. In other words, if the entire SCHED_FAIR tasks are given X amount of CPU time, then a weight w task is guaranteed with w/X time in any interval of time. tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Mon, Jul 23, 2007 at 03:25:12PM -0600, Chris Friesen wrote: > Li, Tong N wrote: > >On the other hand, if locking does > >become a problem for certain systems/workloads, increasing > >sysctl_base_round_slice can reduce the locking frequency and alleviate > >the problem, at the cost of being relatively less fair across the CPUs. > > If locking does become a problem, it may make sense to switch instead to > a lockless algorithm of some sort... That usually has the same issue with cache lines. It would be better to analyze first if such global synchronization is really needed. I hope not. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Chris Snook wrote: Tong Li wrote: This patch extends CFS to achieve better fairness for SMPs. For example, with 10 tasks (same priority) on 8 CPUs, it enables each task to receive equal CPU time (80%). The code works on top of CFS and provides SMP fairness at a coarser time grainularity; local on each CPU, it relies on CFS to provide fine-grained fairness and good interactivity. The code is based on the distributed weighted round-robin (DWRR) algorithm. It keeps two RB trees on each CPU: one is the original cfs_rq, referred to as active, and one is a new cfs_rq, called round-expired. Each CPU keeps a round number, initially zero. The scheduler works exactly the same way as in CFS, but only runs tasks from the active tree. Each task is assigned a round slice, equal to its weight times a system constant (e.g., 100ms), controlled by sysctl_base_round_slice. When a task uses up its round slice, it moves to the round-expired tree on the same CPU and stops running. Thus, at any time on each CPU, the active tree contains all tasks that are running in the current round, while tasks in round-expired have all finished the current round and await to start the next round. When an active tree becomes empty, it calls idle_balance() to grab tasks of the same round from other CPUs. If none can be moved over, it switches its active and round-expired trees, thus unleashing round-expired tasks and advancing the local round number by one. An invariant it maintains is that the round numbers of any two CPUs in the system differ by at most one. This property ensures fairness across CPUs. The variable sysctl_base_round_slice controls fairness-performance tradeoffs: a smaller value leads to better cross-CPU fairness at the potential cost of performance; on the other hand, the larger the value is, the closer the system behavior is to the default CFS without the patch. Any comments and suggestions would be highly appreciated. This patch is massive overkill. Maybe you're not seeing the overhead on your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT workload. Do you have any data justifying the need for this patch? Doing anything globally is expensive, and should be avoided at all costs. The scheduler already rebalances when a CPU is idle, so you're really just rebalancing the overload here. On a server workload, we don't necessarily want to do that, since the overload may be multiple threads spawned to service a single request, and could be sharing a lot of data. Instead of an explicit system-wide fairness invariant (which will get very hard to enforce when you throw SCHED_FIFO processes into the mix and the scheduler isn't running on some CPUs), try a simpler invariant. If we guarantee that the load on CPU X does not differ from the load on CPU (X+1)%N by more than some small constant, then we know that the system is fairly balanced. We can achieve global fairness with local balancing, and avoid all this overhead. This has the added advantage of keeping most of the migrations core/socket/node-local on SMT/multicore/NUMA systems. -- Chris To clarify, I'm not suggesting that the "balance with cpu (x+1)%n only" algorithm is the only way to do this. Rather, I'm pointing out that even an extremely simple algorithm can give you fair loading when you already have CFS managing the runqueues. There are countless more sophisticated ways we could do this without using global locking, or possibly without any locking at all, other than the locking we already use during migration. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li wrote: This patch extends CFS to achieve better fairness for SMPs. For example, with 10 tasks (same priority) on 8 CPUs, it enables each task to receive equal CPU time (80%). The code works on top of CFS and provides SMP fairness at a coarser time grainularity; local on each CPU, it relies on CFS to provide fine-grained fairness and good interactivity. The code is based on the distributed weighted round-robin (DWRR) algorithm. It keeps two RB trees on each CPU: one is the original cfs_rq, referred to as active, and one is a new cfs_rq, called round-expired. Each CPU keeps a round number, initially zero. The scheduler works exactly the same way as in CFS, but only runs tasks from the active tree. Each task is assigned a round slice, equal to its weight times a system constant (e.g., 100ms), controlled by sysctl_base_round_slice. When a task uses up its round slice, it moves to the round-expired tree on the same CPU and stops running. Thus, at any time on each CPU, the active tree contains all tasks that are running in the current round, while tasks in round-expired have all finished the current round and await to start the next round. When an active tree becomes empty, it calls idle_balance() to grab tasks of the same round from other CPUs. If none can be moved over, it switches its active and round-expired trees, thus unleashing round-expired tasks and advancing the local round number by one. An invariant it maintains is that the round numbers of any two CPUs in the system differ by at most one. This property ensures fairness across CPUs. The variable sysctl_base_round_slice controls fairness-performance tradeoffs: a smaller value leads to better cross-CPU fairness at the potential cost of performance; on the other hand, the larger the value is, the closer the system behavior is to the default CFS without the patch. Any comments and suggestions would be highly appreciated. This patch is massive overkill. Maybe you're not seeing the overhead on your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT workload. Do you have any data justifying the need for this patch? Doing anything globally is expensive, and should be avoided at all costs. The scheduler already rebalances when a CPU is idle, so you're really just rebalancing the overload here. On a server workload, we don't necessarily want to do that, since the overload may be multiple threads spawned to service a single request, and could be sharing a lot of data. Instead of an explicit system-wide fairness invariant (which will get very hard to enforce when you throw SCHED_FIFO processes into the mix and the scheduler isn't running on some CPUs), try a simpler invariant. If we guarantee that the load on CPU X does not differ from the load on CPU (X+1)%N by more than some small constant, then we know that the system is fairly balanced. We can achieve global fairness with local balancing, and avoid all this overhead. This has the added advantage of keeping most of the migrations core/socket/node-local on SMT/multicore/NUMA systems. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Li, Tong N wrote: On the other hand, if locking does become a problem for certain systems/workloads, increasing sysctl_base_round_slice can reduce the locking frequency and alleviate the problem, at the cost of being relatively less fair across the CPUs. If locking does become a problem, it may make sense to switch instead to a lockless algorithm of some sort... Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
I benchmarked an early version of this code (against 2.6.21) with SPECjbb, SPEComp, kernbench, etc. on an 8-processor system, and didn't see any slowdown compared to the stock scheduler. I'll generate the data again with this version of the code. On the other hand, if locking does become a problem for certain systems/workloads, increasing sysctl_base_round_slice can reduce the locking frequency and alleviate the problem, at the cost of being relatively less fair across the CPUs. tong On Mon, 2007-07-23 at 22:00 +0200, Andi Kleen wrote: > Tong Li <[EMAIL PROTECTED]> writes: > > + > > +read_lock_irqsave(&dwrr_highest_lock, flags); > > That lock looks like it would bounce between CPUs like crazy. > Did you try any benchmarks on a larger system? > > -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
Tong Li <[EMAIL PROTECTED]> writes: > + > +read_lock_irqsave(&dwrr_highest_lock, flags); That lock looks like it would bounce between CPUs like crazy. Did you try any benchmarks on a larger system? -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] scheduler: improve SMP fairness in CFS
This patch extends CFS to achieve better fairness for SMPs. For example, with 10 tasks (same priority) on 8 CPUs, it enables each task to receive equal CPU time (80%). The code works on top of CFS and provides SMP fairness at a coarser time grainularity; local on each CPU, it relies on CFS to provide fine-grained fairness and good interactivity. The code is based on the distributed weighted round-robin (DWRR) algorithm. It keeps two RB trees on each CPU: one is the original cfs_rq, referred to as active, and one is a new cfs_rq, called round-expired. Each CPU keeps a round number, initially zero. The scheduler works exactly the same way as in CFS, but only runs tasks from the active tree. Each task is assigned a round slice, equal to its weight times a system constant (e.g., 100ms), controlled by sysctl_base_round_slice. When a task uses up its round slice, it moves to the round-expired tree on the same CPU and stops running. Thus, at any time on each CPU, the active tree contains all tasks that are running in the current round, while tasks in round-expired have all finished the current round and await to start the next round. When an active tree becomes empty, it calls idle_balance() to grab tasks of the same round from other CPUs. If none can be moved over, it switches its active and round-expired trees, thus unleashing round-expired tasks and advancing the local round number by one. An invariant it maintains is that the round numbers of any two CPUs in the system differ by at most one. This property ensures fairness across CPUs. The variable sysctl_base_round_slice controls fairness-performance tradeoffs: a smaller value leads to better cross-CPU fairness at the potential cost of performance; on the other hand, the larger the value is, the closer the system behavior is to the default CFS without the patch. Any comments and suggestions would be highly appreciated. Thanks, tong Signed-off-by: Tong Li <[EMAIL PROTECTED]> --- include/linux/sched.h |5 kernel/sched.c | 577 ++ kernel/sched_debug.c |8 kernel/sched_fair.c | 124 -- kernel/sysctl.c | 13 + 5 files changed, 604 insertions(+), 123 deletions(-) diff -uprN linux-2.6.23-rc1/include/linux/sched.h linux-2.6.23-rc1-dwrr/include/linux/sched.h --- linux-2.6.23-rc1/include/linux/sched.h 2007-07-22 18:42:28.0 -0700 +++ linux-2.6.23-rc1-dwrr/include/linux/sched.h 2007-07-22 18:52:34.0 -0700 @@ -887,7 +887,7 @@ struct sched_entity { s64 fair_key; struct load_weight load; /* for load-balancing */ struct rb_node run_node; - unsigned inton_rq; + struct cfs_rq *on_rq; /* NULL, active, or round-expired */ u64 wait_start_fair; u64 wait_start; @@ -906,6 +906,8 @@ struct sched_entity { s64 sum_sleep_runtime; unsigned long wait_runtime_overruns; unsigned long wait_runtime_underruns; +/* How long it has run in the current round. */ +u64 round_slice_used; #ifdef CONFIG_FAIR_GROUP_SCHED struct sched_entity *parent; /* rq on which this entity is (to be) queued: */ @@ -1382,6 +1384,7 @@ extern unsigned int sysctl_sched_stat_gr extern unsigned int sysctl_sched_runtime_limit; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; +extern u64 sysctl_sched_base_round_slice; #ifdef CONFIG_RT_MUTEXES extern int rt_mutex_getprio(struct task_struct *p); diff -uprN linux-2.6.23-rc1/kernel/sched.c linux-2.6.23-rc1-dwrr/kernel/sched.c --- linux-2.6.23-rc1/kernel/sched.c 2007-07-22 18:42:28.0 -0700 +++ linux-2.6.23-rc1-dwrr/kernel/sched.c2007-07-22 23:05:54.0 -0700 @@ -184,17 +184,20 @@ struct cfs_rq { u64 exec_clock; s64 wait_runtime; u64 sleeper_bonus; +u64 round; /* round number of this cfs_rq */ unsigned long wait_runtime_overruns, wait_runtime_underruns; +u64 wait_start_fair, wait_start; /* Used to track wait times of +round-expired tasks. */ struct rb_root tasks_timeline; struct rb_node *rb_leftmost; struct rb_node *rb_load_balance_curr; + struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */ #ifdef CONFIG_FAIR_GROUP_SCHED /* 'curr' points to currently running entity on this cfs_rq. * It is set to NULL otherwise (i.e when none are currently running). */ struct sched_entity *curr; - struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */ /* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in * a hierarchy). Non-leaf lrqs hold other higher schedulable entities @@ -239,7 +24