Re: lmbench ctxsw regression with CFS
On Tue, Aug 14, 2007 at 05:23:00AM +0200, Nick Piggin wrote: > On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote: > > Put it this way: if a 50% slowdown in context switch times yields a 5% > > improvement in, say, balancing decisions then it's probably a net win. > > > > Guys, repeat after me: "context switch is not a fast path". Take that > > benchmark and set fire to it. > > It definitely can be. For workloads that are inherently asynchronous, high > speed networking or disk IO (ie. with event generation significantly outside > the control of the kernel or app), then it can be. Sure, you may just be > switching between the main working thread and idle thread, but in that case a > slowdown in the scheduler will be _more_ pronounced because you don't have to > do as much work to actually switch contexts. > > If there was a performance tradeoff involved, then we could think about it, > and you might be right. But this is just a case of "write code to do direct > calls or do indirect calls". > > Ken Chen's last ia64 database benchmark I could find says schedule takes > 6.5% of the clock cycles, the second highest consumer. Considering the > lengths he was going to shave cycles off other paths, I'd call schedule() > a fastpath. Would be really interesting to rerun that benchmark with CFS. > Is anyone at Intel still doing those tests? Yes. schedule() still is in the top 2-3 consumers of kernel time for that workload. We did some tests when CFS was in initial days (I think V2 or so) and it didn't show any regression. We have plans to run that workload with 2.6.23-rc kernels, but other things were taking priority so far... thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Tue, Aug 14, 2007 at 05:23:00AM +0200, Nick Piggin wrote: On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote: Put it this way: if a 50% slowdown in context switch times yields a 5% improvement in, say, balancing decisions then it's probably a net win. Guys, repeat after me: context switch is not a fast path. Take that benchmark and set fire to it. It definitely can be. For workloads that are inherently asynchronous, high speed networking or disk IO (ie. with event generation significantly outside the control of the kernel or app), then it can be. Sure, you may just be switching between the main working thread and idle thread, but in that case a slowdown in the scheduler will be _more_ pronounced because you don't have to do as much work to actually switch contexts. If there was a performance tradeoff involved, then we could think about it, and you might be right. But this is just a case of write code to do direct calls or do indirect calls. Ken Chen's last ia64 database benchmark I could find says schedule takes 6.5% of the clock cycles, the second highest consumer. Considering the lengths he was going to shave cycles off other paths, I'd call schedule() a fastpath. Would be really interesting to rerun that benchmark with CFS. Is anyone at Intel still doing those tests? Yes. schedule() still is in the top 2-3 consumers of kernel time for that workload. We did some tests when CFS was in initial days (I think V2 or so) and it didn't show any regression. We have plans to run that workload with 2.6.23-rc kernels, but other things were taking priority so far... thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
From: Andrew Morton <[EMAIL PROTECTED]> Date: Mon, 13 Aug 2007 20:00:38 -0700 > Guys, repeat after me: "context switch is not a fast path". Take > that benchmark and set fire to it. Nothing in this world is so absolute :-) Regardless of the value of lat_ctx, we should thank it for showing that something is more expensive now. And it definitely warrants figuring out what paths are big cycle eaters now and why. Not "on IA64 indirect calls are expensive, so that must be it", but rather doing cycle analysis of the relevant scheduling functions to figure out what might be wrong. I'm willing to bet it's something completely trivial and easy to amend, rather than a fundamental issue. But somebody has to _look_ instead of supposing this and that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote: > On Mon, 13 Aug 2007 14:30:31 +0200 Jens Axboe <[EMAIL PROTECTED]> wrote: > > > On Mon, Aug 06 2007, Nick Piggin wrote: > > > > > What CPU did you get these numbers on? Do the indirect calls hurt > > > > > much > > > > > on those without an indirect predictor? (I'll try running some tests). > > > > > > > > it was on an older Athlon64 X2. I never saw indirect calls really > > > > hurting on modern x86 CPUs - dont both CPU makers optimize them pretty > > > > efficiently? (as long as the target function is always the same - which > > > > it is here.) > > > > > > I think a lot of CPUs do. I think ia64 does not. It predicts > > > based on the contents of a branch target register which has to > > > be loaded I presume before instructoin fetch reaches the branch. > > > I don't know if this would hurt or not. > > > > Testing on ia64 showed that the indirect calls in the io scheduler hurt > > quite a bit, so I'd be surprised if the impact here wasn't an issue > > there. > > With what workload? lmbench ctxsw? Who cares? > > Look, if you're doing 100,000 context switches per second per then *that* > is your problem. You suck, and making context switches a bit faster > doesn't stop you from sucking. And ten microseconds is a very long time > indeed. > > Put it this way: if a 50% slowdown in context switch times yields a 5% > improvement in, say, balancing decisions then it's probably a net win. > > Guys, repeat after me: "context switch is not a fast path". Take that > benchmark and set fire to it. It definitely can be. For workloads that are inherently asynchronous, high speed networking or disk IO (ie. with event generation significantly outside the control of the kernel or app), then it can be. Sure, you may just be switching between the main working thread and idle thread, but in that case a slowdown in the scheduler will be _more_ pronounced because you don't have to do as much work to actually switch contexts. If there was a performance tradeoff involved, then we could think about it, and you might be right. But this is just a case of "write code to do direct calls or do indirect calls". Ken Chen's last ia64 database benchmark I could find says schedule takes 6.5% of the clock cycles, the second highest consumer. Considering the lengths he was going to shave cycles off other paths, I'd call schedule() a fastpath. Would be really interesting to rerun that benchmark with CFS. Is anyone at Intel still doing those tests? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Mon, 13 Aug 2007 14:30:31 +0200 Jens Axboe <[EMAIL PROTECTED]> wrote: > On Mon, Aug 06 2007, Nick Piggin wrote: > > > > What CPU did you get these numbers on? Do the indirect calls hurt much > > > > on those without an indirect predictor? (I'll try running some tests). > > > > > > it was on an older Athlon64 X2. I never saw indirect calls really > > > hurting on modern x86 CPUs - dont both CPU makers optimize them pretty > > > efficiently? (as long as the target function is always the same - which > > > it is here.) > > > > I think a lot of CPUs do. I think ia64 does not. It predicts > > based on the contents of a branch target register which has to > > be loaded I presume before instructoin fetch reaches the branch. > > I don't know if this would hurt or not. > > Testing on ia64 showed that the indirect calls in the io scheduler hurt > quite a bit, so I'd be surprised if the impact here wasn't an issue > there. With what workload? lmbench ctxsw? Who cares? Look, if you're doing 100,000 context switches per second per then *that* is your problem. You suck, and making context switches a bit faster doesn't stop you from sucking. And ten microseconds is a very long time indeed. Put it this way: if a 50% slowdown in context switch times yields a 5% improvement in, say, balancing decisions then it's probably a net win. Guys, repeat after me: "context switch is not a fast path". Take that benchmark and set fire to it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Mon, Aug 06 2007, Nick Piggin wrote: > > > What CPU did you get these numbers on? Do the indirect calls hurt much > > > on those without an indirect predictor? (I'll try running some tests). > > > > it was on an older Athlon64 X2. I never saw indirect calls really > > hurting on modern x86 CPUs - dont both CPU makers optimize them pretty > > efficiently? (as long as the target function is always the same - which > > it is here.) > > I think a lot of CPUs do. I think ia64 does not. It predicts > based on the contents of a branch target register which has to > be loaded I presume before instructoin fetch reaches the branch. > I don't know if this would hurt or not. Testing on ia64 showed that the indirect calls in the io scheduler hurt quite a bit, so I'd be surprised if the impact here wasn't an issue there. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Mon, Aug 06 2007, Nick Piggin wrote: What CPU did you get these numbers on? Do the indirect calls hurt much on those without an indirect predictor? (I'll try running some tests). it was on an older Athlon64 X2. I never saw indirect calls really hurting on modern x86 CPUs - dont both CPU makers optimize them pretty efficiently? (as long as the target function is always the same - which it is here.) I think a lot of CPUs do. I think ia64 does not. It predicts based on the contents of a branch target register which has to be loaded I presume before instructoin fetch reaches the branch. I don't know if this would hurt or not. Testing on ia64 showed that the indirect calls in the io scheduler hurt quite a bit, so I'd be surprised if the impact here wasn't an issue there. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Mon, 13 Aug 2007 14:30:31 +0200 Jens Axboe [EMAIL PROTECTED] wrote: On Mon, Aug 06 2007, Nick Piggin wrote: What CPU did you get these numbers on? Do the indirect calls hurt much on those without an indirect predictor? (I'll try running some tests). it was on an older Athlon64 X2. I never saw indirect calls really hurting on modern x86 CPUs - dont both CPU makers optimize them pretty efficiently? (as long as the target function is always the same - which it is here.) I think a lot of CPUs do. I think ia64 does not. It predicts based on the contents of a branch target register which has to be loaded I presume before instructoin fetch reaches the branch. I don't know if this would hurt or not. Testing on ia64 showed that the indirect calls in the io scheduler hurt quite a bit, so I'd be surprised if the impact here wasn't an issue there. With what workload? lmbench ctxsw? Who cares? Look, if you're doing 100,000 context switches per second per then *that* is your problem. You suck, and making context switches a bit faster doesn't stop you from sucking. And ten microseconds is a very long time indeed. Put it this way: if a 50% slowdown in context switch times yields a 5% improvement in, say, balancing decisions then it's probably a net win. Guys, repeat after me: context switch is not a fast path. Take that benchmark and set fire to it. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote: On Mon, 13 Aug 2007 14:30:31 +0200 Jens Axboe [EMAIL PROTECTED] wrote: On Mon, Aug 06 2007, Nick Piggin wrote: What CPU did you get these numbers on? Do the indirect calls hurt much on those without an indirect predictor? (I'll try running some tests). it was on an older Athlon64 X2. I never saw indirect calls really hurting on modern x86 CPUs - dont both CPU makers optimize them pretty efficiently? (as long as the target function is always the same - which it is here.) I think a lot of CPUs do. I think ia64 does not. It predicts based on the contents of a branch target register which has to be loaded I presume before instructoin fetch reaches the branch. I don't know if this would hurt or not. Testing on ia64 showed that the indirect calls in the io scheduler hurt quite a bit, so I'd be surprised if the impact here wasn't an issue there. With what workload? lmbench ctxsw? Who cares? Look, if you're doing 100,000 context switches per second per then *that* is your problem. You suck, and making context switches a bit faster doesn't stop you from sucking. And ten microseconds is a very long time indeed. Put it this way: if a 50% slowdown in context switch times yields a 5% improvement in, say, balancing decisions then it's probably a net win. Guys, repeat after me: context switch is not a fast path. Take that benchmark and set fire to it. It definitely can be. For workloads that are inherently asynchronous, high speed networking or disk IO (ie. with event generation significantly outside the control of the kernel or app), then it can be. Sure, you may just be switching between the main working thread and idle thread, but in that case a slowdown in the scheduler will be _more_ pronounced because you don't have to do as much work to actually switch contexts. If there was a performance tradeoff involved, then we could think about it, and you might be right. But this is just a case of write code to do direct calls or do indirect calls. Ken Chen's last ia64 database benchmark I could find says schedule takes 6.5% of the clock cycles, the second highest consumer. Considering the lengths he was going to shave cycles off other paths, I'd call schedule() a fastpath. Would be really interesting to rerun that benchmark with CFS. Is anyone at Intel still doing those tests? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
From: Andrew Morton [EMAIL PROTECTED] Date: Mon, 13 Aug 2007 20:00:38 -0700 Guys, repeat after me: context switch is not a fast path. Take that benchmark and set fire to it. Nothing in this world is so absolute :-) Regardless of the value of lat_ctx, we should thank it for showing that something is more expensive now. And it definitely warrants figuring out what paths are big cycle eaters now and why. Not on IA64 indirect calls are expensive, so that must be it, but rather doing cycle analysis of the relevant scheduling functions to figure out what might be wrong. I'm willing to bet it's something completely trivial and easy to amend, rather than a fundamental issue. But somebody has to _look_ instead of supposing this and that. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Sat, Aug 04, 2007 at 08:50:37AM +0200, Ingo Molnar wrote: > > * Nick Piggin <[EMAIL PROTECTED]> wrote: > > > Oh good. Thanks for getting to the bottom of it. We have normally > > disliked too much runtime tunables in the scheduler, so I assume these > > are mostly going away or under a CONFIG option for 2.6.23? Or...? > > yeah, they are all already under CONFIG_SCHED_DEBUG. (it's just that the > add-on optimization is not upstream yet - the tunings are still being Ah, OK. So long as that goes upstream I'm happy... and it is good to see that with that patch, the base context switching performance _has_ actually gone up like I had hoped. Nice. > tested) Btw., with SCHED_DEBUG we now also have your domain-tree sysctl > patch upstream, which has been in -mm for a near eternity. > > > What CPU did you get these numbers on? Do the indirect calls hurt much > > on those without an indirect predictor? (I'll try running some tests). > > it was on an older Athlon64 X2. I never saw indirect calls really > hurting on modern x86 CPUs - dont both CPU makers optimize them pretty > efficiently? (as long as the target function is always the same - which > it is here.) I think a lot of CPUs do. I think ia64 does not. It predicts based on the contents of a branch target register which has to be loaded I presume before instructoin fetch reaches the branch. I don't know if this would hurt or not. > > I must say that I don't really like the indirect calls a great deal, > > and they could be eliminated just with a couple of branches and direct > > calls. > > yeah - i'll try that too. We can make the indirect call the uncommon > case and a NULL pointer be the common case, combined with a 'default', > direct function call. But i doubt it makes a big (or even measurable) > difference. You might be right there. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Sat, Aug 04, 2007 at 08:50:37AM +0200, Ingo Molnar wrote: * Nick Piggin [EMAIL PROTECTED] wrote: Oh good. Thanks for getting to the bottom of it. We have normally disliked too much runtime tunables in the scheduler, so I assume these are mostly going away or under a CONFIG option for 2.6.23? Or...? yeah, they are all already under CONFIG_SCHED_DEBUG. (it's just that the add-on optimization is not upstream yet - the tunings are still being Ah, OK. So long as that goes upstream I'm happy... and it is good to see that with that patch, the base context switching performance _has_ actually gone up like I had hoped. Nice. tested) Btw., with SCHED_DEBUG we now also have your domain-tree sysctl patch upstream, which has been in -mm for a near eternity. What CPU did you get these numbers on? Do the indirect calls hurt much on those without an indirect predictor? (I'll try running some tests). it was on an older Athlon64 X2. I never saw indirect calls really hurting on modern x86 CPUs - dont both CPU makers optimize them pretty efficiently? (as long as the target function is always the same - which it is here.) I think a lot of CPUs do. I think ia64 does not. It predicts based on the contents of a branch target register which has to be loaded I presume before instructoin fetch reaches the branch. I don't know if this would hurt or not. I must say that I don't really like the indirect calls a great deal, and they could be eliminated just with a couple of branches and direct calls. yeah - i'll try that too. We can make the indirect call the uncommon case and a NULL pointer be the common case, combined with a 'default', direct function call. But i doubt it makes a big (or even measurable) difference. You might be right there. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
* Nick Piggin <[EMAIL PROTECTED]> wrote: > Oh good. Thanks for getting to the bottom of it. We have normally > disliked too much runtime tunables in the scheduler, so I assume these > are mostly going away or under a CONFIG option for 2.6.23? Or...? yeah, they are all already under CONFIG_SCHED_DEBUG. (it's just that the add-on optimization is not upstream yet - the tunings are still being tested) Btw., with SCHED_DEBUG we now also have your domain-tree sysctl patch upstream, which has been in -mm for a near eternity. > What CPU did you get these numbers on? Do the indirect calls hurt much > on those without an indirect predictor? (I'll try running some tests). it was on an older Athlon64 X2. I never saw indirect calls really hurting on modern x86 CPUs - dont both CPU makers optimize them pretty efficiently? (as long as the target function is always the same - which it is here.) > I must say that I don't really like the indirect calls a great deal, > and they could be eliminated just with a couple of branches and direct > calls. yeah - i'll try that too. We can make the indirect call the uncommon case and a NULL pointer be the common case, combined with a 'default', direct function call. But i doubt it makes a big (or even measurable) difference. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
* Nick Piggin [EMAIL PROTECTED] wrote: Oh good. Thanks for getting to the bottom of it. We have normally disliked too much runtime tunables in the scheduler, so I assume these are mostly going away or under a CONFIG option for 2.6.23? Or...? yeah, they are all already under CONFIG_SCHED_DEBUG. (it's just that the add-on optimization is not upstream yet - the tunings are still being tested) Btw., with SCHED_DEBUG we now also have your domain-tree sysctl patch upstream, which has been in -mm for a near eternity. What CPU did you get these numbers on? Do the indirect calls hurt much on those without an indirect predictor? (I'll try running some tests). it was on an older Athlon64 X2. I never saw indirect calls really hurting on modern x86 CPUs - dont both CPU makers optimize them pretty efficiently? (as long as the target function is always the same - which it is here.) I must say that I don't really like the indirect calls a great deal, and they could be eliminated just with a couple of branches and direct calls. yeah - i'll try that too. We can make the indirect call the uncommon case and a NULL pointer be the common case, combined with a 'default', direct function call. But i doubt it makes a big (or even measurable) difference. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Thu, Aug 02, 2007 at 05:44:47PM +0200, Ingo Molnar wrote: > > * Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > > > One thing to check out is whether the lmbench numbers are > > > > > "correct". Especially on SMP systems, the lmbench numbers are > > > > > actually *best* when the two processes run on the same CPU, even > > > > > though that's not really at all the best scheduling - it's just > > > > > that it artificially improves lmbench numbers because of the > > > > > close cache affinity for the pipe data structures. > > > > > > > > Yes, I bound them to a single core. > > > > > > could you send me the .config you used? > > > > Sure, attached... > > > > You don't see a regression? If not, then can you send me the .config > > you used? [...] > > i used your config to get a few numbers and to see what happens. Here's > the numbers of 10 consecutive "lat_ctx -s 0 2" runs: > > [ time in micro-seconds, smaller is better ] > > v2.6.22 v2.6.23-git v2.6.23-git+const-param > --- --- --- > 1.30 1.60 1.19 > 1.30 1.36 1.18 > 1.14 1.50 1.01 > 1.26 1.27 1.23 > 1.22 1.40 1.04 > 1.13 1.34 1.09 > 1.27 1.39 1.05 > 1.20 1.30 1.16 > 1.20 1.17 1.16 > 1.25 1.33 1.01 >- > avg: 1.22 1.36 (+11.3%) 1.11 (-10.3%) > min: 1.13 1.17 ( +3.5%) 1.01 (-11.8%) > max: 1.27 1.60 (+26.0%) 1.23 ( -3.2%) > > one reason for the extra overhead is the current tunability of CFS, but > that is not fundamental, it's caused by the many knobs that CFS has at > the moment. The const-tuning patch (attached below, results in the > rightmost column) changes those knobs to constants, allowing the > compiler to optimize the math better and reduce code size. (the code > movement in the patch makes up for most of its size, the change that it > does is simple otherwise.) [...] Oh good. Thanks for getting to the bottom of it. We have normally disliked too much runtime tunables in the scheduler, so I assume these are mostly going away or under a CONFIG option for 2.6.23? Or...? What CPU did you get these numbers on? Do the indirect calls hurt much on those without an indirect predictor? (I'll try running some tests). I must say that I don't really like the indirect calls a great deal, and they could be eliminated just with a couple of branches and direct calls. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
* Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > One thing to check out is whether the lmbench numbers are > > > > "correct". Especially on SMP systems, the lmbench numbers are > > > > actually *best* when the two processes run on the same CPU, even > > > > though that's not really at all the best scheduling - it's just > > > > that it artificially improves lmbench numbers because of the > > > > close cache affinity for the pipe data structures. > > > > > > Yes, I bound them to a single core. > > > > could you send me the .config you used? > > Sure, attached... > > You don't see a regression? If not, then can you send me the .config > you used? [...] i used your config to get a few numbers and to see what happens. Here's the numbers of 10 consecutive "lat_ctx -s 0 2" runs: [ time in micro-seconds, smaller is better ] v2.6.22 v2.6.23-git v2.6.23-git+const-param --- --- --- 1.30 1.60 1.19 1.30 1.36 1.18 1.14 1.50 1.01 1.26 1.27 1.23 1.22 1.40 1.04 1.13 1.34 1.09 1.27 1.39 1.05 1.20 1.30 1.16 1.20 1.17 1.16 1.25 1.33 1.01 - avg: 1.22 1.36 (+11.3%) 1.11 (-10.3%) min: 1.13 1.17 ( +3.5%) 1.01 (-11.8%) max: 1.27 1.60 (+26.0%) 1.23 ( -3.2%) one reason for the extra overhead is the current tunability of CFS, but that is not fundamental, it's caused by the many knobs that CFS has at the moment. The const-tuning patch (attached below, results in the rightmost column) changes those knobs to constants, allowing the compiler to optimize the math better and reduce code size. (the code movement in the patch makes up for most of its size, the change that it does is simple otherwise.) so CFS can be faster at micro-context-switching than 2.6.22. But, at this point i'd also like to warn against putting _too_ much emphasis on lat_ctx numbers in general. lat_ctx prints a 'derived' micro-benchmark number. It uses a pair of pipes to context-switch between tasks but only prints the delta overhead that context-switching causes. The 'full' latency of the pipe operations can be seen via the following pipe-test.c code: http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test.c run it to see the full cost: neptune:~> ./pipe-test 4.67 usecs/loop. 4.41 usecs/loop. 4.46 usecs/loop. 4.46 usecs/loop. 4.44 usecs/loop. 4.41 usecs/loop. so the _full_ cost, of even this micro-benchmark, is 4-5 microseconds, not 1 microsecond. So even this artificial micro-benchmark sees an actual slowdown of only 2.8%. if you check a macro-benchmark like "hackbench 50": [ time in seconds, smaller is better ] v2.6.22 v2.6.23-cfs --- --- 3.019 2.842 2.994 2.878 2.977 2.882 3.012 2.864 2.996 2.882 then the difference is even starker because with CFS the _quality_ of scheduling decisions has increased. So even if we had increased micro-costs (which we wont have once the current tuning period is over and we cast the CFS parameters into constants), the quality of macro-scheduling can offset that, and not only on the desktop! so that's why our main focus in CFS was on the macro-properties of scheduling _first_, and then the micro-properties are adjusted to the macro-constraints as a second layer. Ingo -> --- include/linux/sched.h |2 kernel/sched.c| 143 +- kernel/sched_fair.c | 27 + kernel/sched_rt.c | 10 --- 4 files changed, 92 insertions(+), 90 deletions(-) Index: linux/include/linux/sched.h === --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1396,6 +1396,7 @@ static inline void idle_task_exit(void) extern void sched_idle_next(void); +#ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; @@ -1403,6 +1404,7 @@ extern unsigned int sysctl_sched_stat_gr extern unsigned int sysctl_sched_runtime_limit; extern unsigned int sysctl_sched_child_runs_first; extern unsigned
Re: lmbench ctxsw regression with CFS
On Thu, Aug 02, 2007 at 09:19:56AM +0200, Ingo Molnar wrote: > > * Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > One thing to check out is whether the lmbench numbers are "correct". > > > Especially on SMP systems, the lmbench numbers are actually *best* > > > when the two processes run on the same CPU, even though that's not > > > really at all the best scheduling - it's just that it artificially > > > improves lmbench numbers because of the close cache affinity for the > > > pipe data structures. > > > > Yes, I bound them to a single core. > > could you send me the .config you used? Sure, attached... You don't see a regression? If not, then can you send me the .config you used? Also what CPU architecture (when I tested an older CFS on a P4 IIRC the regression was much bigger like 100% more costly). --- # # Automatically generated make config: don't edit # Linux kernel version: 2.6.23-rc1 # Tue Jul 31 21:43:43 2007 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_QUICKLIST=y CONFIG_NR_QUICK=2 CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_USER_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=18 # CONFIG_CPUSETS is not set # CONFIG_SYSFS_DEPRECATED is not set # CONFIG_RELAY is not set # CONFIG_BLK_DEV_INITRD is not set CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y CONFIG_EMBEDDED=y # CONFIG_UID16 is not set CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set # CONFIG_KMOD is not set CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set CONFIG_BLK_DEV_BSG=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="anticipatory" # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set # CONFIG_MK8 is not set # CONFIG_MPSC is not set CONFIG_MCORE2=y # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=y CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y CONFIG_X86_HT=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set # CONFIG_NUMA is not set CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_RESOURCES_64BIT=y CONFIG_ZONE_DMA_FLAG=1 CONFIG_BOUNCE=y CONFIG_VIRT_TO_BUS=y CONFIG_NR_CPUS=2 CONFIG_PHYSICAL_ALIGN=0x20 CONFIG_HOTPLUG_CPU=y CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y # CONFIG_IOMMU is not set # CONFIG_CALGARY_IOMMU is not set CONFIG_X86_MCE=y CONFIG_X86_MCE_INTEL=y # CONFIG_X86_MCE_AMD is not set CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y CONFIG_RELOCATABLE=y CONFIG_PHYSICAL_START=0x20 # CONFIG_SECCOMP is not set # CONFIG_CC_STACKPROTECTOR is not set CONFIG_HZ_100=y # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=100 CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y
Re: lmbench ctxsw regression with CFS
* Nick Piggin <[EMAIL PROTECTED]> wrote: > > One thing to check out is whether the lmbench numbers are "correct". > > Especially on SMP systems, the lmbench numbers are actually *best* > > when the two processes run on the same CPU, even though that's not > > really at all the best scheduling - it's just that it artificially > > improves lmbench numbers because of the close cache affinity for the > > pipe data structures. > > Yes, I bound them to a single core. could you send me the .config you used? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
* Nick Piggin [EMAIL PROTECTED] wrote: One thing to check out is whether the lmbench numbers are correct. Especially on SMP systems, the lmbench numbers are actually *best* when the two processes run on the same CPU, even though that's not really at all the best scheduling - it's just that it artificially improves lmbench numbers because of the close cache affinity for the pipe data structures. Yes, I bound them to a single core. could you send me the .config you used? Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Thu, Aug 02, 2007 at 09:19:56AM +0200, Ingo Molnar wrote: * Nick Piggin [EMAIL PROTECTED] wrote: One thing to check out is whether the lmbench numbers are correct. Especially on SMP systems, the lmbench numbers are actually *best* when the two processes run on the same CPU, even though that's not really at all the best scheduling - it's just that it artificially improves lmbench numbers because of the close cache affinity for the pipe data structures. Yes, I bound them to a single core. could you send me the .config you used? Sure, attached... You don't see a regression? If not, then can you send me the .config you used? Also what CPU architecture (when I tested an older CFS on a P4 IIRC the regression was much bigger like 100% more costly). --- # # Automatically generated make config: don't edit # Linux kernel version: 2.6.23-rc1 # Tue Jul 31 21:43:43 2007 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_QUICKLIST=y CONFIG_NR_QUICK=2 CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION= CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_USER_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=18 # CONFIG_CPUSETS is not set # CONFIG_SYSFS_DEPRECATED is not set # CONFIG_RELAY is not set # CONFIG_BLK_DEV_INITRD is not set CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y CONFIG_EMBEDDED=y # CONFIG_UID16 is not set CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set # CONFIG_KMOD is not set CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set CONFIG_BLK_DEV_BSG=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED=anticipatory # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set # CONFIG_MK8 is not set # CONFIG_MPSC is not set CONFIG_MCORE2=y # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=y CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y CONFIG_X86_HT=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set # CONFIG_NUMA is not set CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_RESOURCES_64BIT=y CONFIG_ZONE_DMA_FLAG=1 CONFIG_BOUNCE=y CONFIG_VIRT_TO_BUS=y CONFIG_NR_CPUS=2 CONFIG_PHYSICAL_ALIGN=0x20 CONFIG_HOTPLUG_CPU=y CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y # CONFIG_IOMMU is not set # CONFIG_CALGARY_IOMMU is not set CONFIG_X86_MCE=y CONFIG_X86_MCE_INTEL=y # CONFIG_X86_MCE_AMD is not set CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y CONFIG_RELOCATABLE=y CONFIG_PHYSICAL_START=0x20 # CONFIG_SECCOMP is not set # CONFIG_CC_STACKPROTECTOR is not set CONFIG_HZ_100=y # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=100 CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y CONFIG_GENERIC_PENDING_IRQ=y # # Power
Re: lmbench ctxsw regression with CFS
* Nick Piggin [EMAIL PROTECTED] wrote: One thing to check out is whether the lmbench numbers are correct. Especially on SMP systems, the lmbench numbers are actually *best* when the two processes run on the same CPU, even though that's not really at all the best scheduling - it's just that it artificially improves lmbench numbers because of the close cache affinity for the pipe data structures. Yes, I bound them to a single core. could you send me the .config you used? Sure, attached... You don't see a regression? If not, then can you send me the .config you used? [...] i used your config to get a few numbers and to see what happens. Here's the numbers of 10 consecutive lat_ctx -s 0 2 runs: [ time in micro-seconds, smaller is better ] v2.6.22 v2.6.23-git v2.6.23-git+const-param --- --- --- 1.30 1.60 1.19 1.30 1.36 1.18 1.14 1.50 1.01 1.26 1.27 1.23 1.22 1.40 1.04 1.13 1.34 1.09 1.27 1.39 1.05 1.20 1.30 1.16 1.20 1.17 1.16 1.25 1.33 1.01 - avg: 1.22 1.36 (+11.3%) 1.11 (-10.3%) min: 1.13 1.17 ( +3.5%) 1.01 (-11.8%) max: 1.27 1.60 (+26.0%) 1.23 ( -3.2%) one reason for the extra overhead is the current tunability of CFS, but that is not fundamental, it's caused by the many knobs that CFS has at the moment. The const-tuning patch (attached below, results in the rightmost column) changes those knobs to constants, allowing the compiler to optimize the math better and reduce code size. (the code movement in the patch makes up for most of its size, the change that it does is simple otherwise.) so CFS can be faster at micro-context-switching than 2.6.22. But, at this point i'd also like to warn against putting _too_ much emphasis on lat_ctx numbers in general. lat_ctx prints a 'derived' micro-benchmark number. It uses a pair of pipes to context-switch between tasks but only prints the delta overhead that context-switching causes. The 'full' latency of the pipe operations can be seen via the following pipe-test.c code: http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test.c run it to see the full cost: neptune:~ ./pipe-test 4.67 usecs/loop. 4.41 usecs/loop. 4.46 usecs/loop. 4.46 usecs/loop. 4.44 usecs/loop. 4.41 usecs/loop. so the _full_ cost, of even this micro-benchmark, is 4-5 microseconds, not 1 microsecond. So even this artificial micro-benchmark sees an actual slowdown of only 2.8%. if you check a macro-benchmark like hackbench 50: [ time in seconds, smaller is better ] v2.6.22 v2.6.23-cfs --- --- 3.019 2.842 2.994 2.878 2.977 2.882 3.012 2.864 2.996 2.882 then the difference is even starker because with CFS the _quality_ of scheduling decisions has increased. So even if we had increased micro-costs (which we wont have once the current tuning period is over and we cast the CFS parameters into constants), the quality of macro-scheduling can offset that, and not only on the desktop! so that's why our main focus in CFS was on the macro-properties of scheduling _first_, and then the micro-properties are adjusted to the macro-constraints as a second layer. Ingo - --- include/linux/sched.h |2 kernel/sched.c| 143 +- kernel/sched_fair.c | 27 + kernel/sched_rt.c | 10 --- 4 files changed, 92 insertions(+), 90 deletions(-) Index: linux/include/linux/sched.h === --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1396,6 +1396,7 @@ static inline void idle_task_exit(void) extern void sched_idle_next(void); +#ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; @@ -1403,6 +1404,7 @@ extern unsigned int sysctl_sched_stat_gr extern unsigned int sysctl_sched_runtime_limit; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; +#endif #ifdef
Re: lmbench ctxsw regression with CFS
On Thu, Aug 02, 2007 at 05:44:47PM +0200, Ingo Molnar wrote: * Nick Piggin [EMAIL PROTECTED] wrote: One thing to check out is whether the lmbench numbers are correct. Especially on SMP systems, the lmbench numbers are actually *best* when the two processes run on the same CPU, even though that's not really at all the best scheduling - it's just that it artificially improves lmbench numbers because of the close cache affinity for the pipe data structures. Yes, I bound them to a single core. could you send me the .config you used? Sure, attached... You don't see a regression? If not, then can you send me the .config you used? [...] i used your config to get a few numbers and to see what happens. Here's the numbers of 10 consecutive lat_ctx -s 0 2 runs: [ time in micro-seconds, smaller is better ] v2.6.22 v2.6.23-git v2.6.23-git+const-param --- --- --- 1.30 1.60 1.19 1.30 1.36 1.18 1.14 1.50 1.01 1.26 1.27 1.23 1.22 1.40 1.04 1.13 1.34 1.09 1.27 1.39 1.05 1.20 1.30 1.16 1.20 1.17 1.16 1.25 1.33 1.01 - avg: 1.22 1.36 (+11.3%) 1.11 (-10.3%) min: 1.13 1.17 ( +3.5%) 1.01 (-11.8%) max: 1.27 1.60 (+26.0%) 1.23 ( -3.2%) one reason for the extra overhead is the current tunability of CFS, but that is not fundamental, it's caused by the many knobs that CFS has at the moment. The const-tuning patch (attached below, results in the rightmost column) changes those knobs to constants, allowing the compiler to optimize the math better and reduce code size. (the code movement in the patch makes up for most of its size, the change that it does is simple otherwise.) [...] Oh good. Thanks for getting to the bottom of it. We have normally disliked too much runtime tunables in the scheduler, so I assume these are mostly going away or under a CONFIG option for 2.6.23? Or...? What CPU did you get these numbers on? Do the indirect calls hurt much on those without an indirect predictor? (I'll try running some tests). I must say that I don't really like the indirect calls a great deal, and they could be eliminated just with a couple of branches and direct calls. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Wed, Aug 01, 2007 at 07:31:26PM -0700, Linus Torvalds wrote: > > > On Thu, 2 Aug 2007, Nick Piggin wrote: > > > > lmbench 3 lat_ctx context switching time with 2 processes bound to a > > single core increases by between 25%-35% on my Core2 system (didn't do > > enough runs to get more significance, but it is around 30%). The problem > > bisected to the main CFS commit. > > One thing to check out is whether the lmbench numbers are "correct". > Especially on SMP systems, the lmbench numbers are actually *best* when > the two processes run on the same CPU, even though that's not really at > all the best scheduling - it's just that it artificially improves lmbench > numbers because of the close cache affinity for the pipe data structures. Yes, I bound them to a single core. > So when running the lmbench scheduling benchmarks on SMP, it actually > makes sense to run them *pinned* to one CPU, because then you see the true > scheduler performance. Otherwise you easily get noise due to balancing > issues, and a clearly better scheduler can in fact generate worse > numbers for lmbench. > > Did you do that? It's at least worth testing. I'm not saying it's the case > here, but it's one reason why lmbench3 has the option to either keep > processes on the same CPU or force them to spread out (and both cases are > very interesting for scheduler testing, and tell different things: the > "pin them to the same CPU" shows the latency on one runqueue, while the > "pin them to different CPU's" shows the latency of a remote wakeup). > > IOW, while we used the lmbench scheduling benchmark pretty extensively in > early scheduler tuning, if you select the defaults ("let the system just > schedule processes on any CPU") the end result really isn't necessarily a > very meaningful value: getting the best lmbench numbers actually requires > you to do things that tend to be actively *bad* in real life. > > Of course, a perfect scheduler would notice when two tasks are *so* > closely related and only do synchronous wakups, that it would keep them on > the same core, and get the best possible scores for lmbench, while not > doing that for other real-life situations. So with a *really* smart > scheduler, lmbench numbers would always be optimal, but I'm not sure > aiming for that kind of perfection is even worth it! Agreed with all your comments on multiprocessor balancing, but that was eliminated in these tests. I remote wakeup latency is another thing I want to test, but it isn't so interesting until the serial regression is fixed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Thu, 2 Aug 2007, Nick Piggin wrote: > > lmbench 3 lat_ctx context switching time with 2 processes bound to a > single core increases by between 25%-35% on my Core2 system (didn't do > enough runs to get more significance, but it is around 30%). The problem > bisected to the main CFS commit. One thing to check out is whether the lmbench numbers are "correct". Especially on SMP systems, the lmbench numbers are actually *best* when the two processes run on the same CPU, even though that's not really at all the best scheduling - it's just that it artificially improves lmbench numbers because of the close cache affinity for the pipe data structures. So when running the lmbench scheduling benchmarks on SMP, it actually makes sense to run them *pinned* to one CPU, because then you see the true scheduler performance. Otherwise you easily get noise due to balancing issues, and a clearly better scheduler can in fact generate worse numbers for lmbench. Did you do that? It's at least worth testing. I'm not saying it's the case here, but it's one reason why lmbench3 has the option to either keep processes on the same CPU or force them to spread out (and both cases are very interesting for scheduler testing, and tell different things: the "pin them to the same CPU" shows the latency on one runqueue, while the "pin them to different CPU's" shows the latency of a remote wakeup). IOW, while we used the lmbench scheduling benchmark pretty extensively in early scheduler tuning, if you select the defaults ("let the system just schedule processes on any CPU") the end result really isn't necessarily a very meaningful value: getting the best lmbench numbers actually requires you to do things that tend to be actively *bad* in real life. Of course, a perfect scheduler would notice when two tasks are *so* closely related and only do synchronous wakups, that it would keep them on the same core, and get the best possible scores for lmbench, while not doing that for other real-life situations. So with a *really* smart scheduler, lmbench numbers would always be optimal, but I'm not sure aiming for that kind of perfection is even worth it! Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
lmbench ctxsw regression with CFS
Hi, I didn't follow all of the scheduler debates and flamewars, so apologies if this was already covered. Anyway. lmbench 3 lat_ctx context switching time with 2 processes bound to a single core increases by between 25%-35% on my Core2 system (didn't do enough runs to get more significance, but it is around 30%). The problem bisected to the main CFS commit. I was really hoping that a smaller runqueue data structure could actually increase performance with the common case of small numbers of tasks :( I assume this was a known issue before CFS was merged. Do you know what is causing the slowdown? Any plans to fix it? Thanks, Nick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
lmbench ctxsw regression with CFS
Hi, I didn't follow all of the scheduler debates and flamewars, so apologies if this was already covered. Anyway. lmbench 3 lat_ctx context switching time with 2 processes bound to a single core increases by between 25%-35% on my Core2 system (didn't do enough runs to get more significance, but it is around 30%). The problem bisected to the main CFS commit. I was really hoping that a smaller runqueue data structure could actually increase performance with the common case of small numbers of tasks :( I assume this was a known issue before CFS was merged. Do you know what is causing the slowdown? Any plans to fix it? Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Thu, 2 Aug 2007, Nick Piggin wrote: lmbench 3 lat_ctx context switching time with 2 processes bound to a single core increases by between 25%-35% on my Core2 system (didn't do enough runs to get more significance, but it is around 30%). The problem bisected to the main CFS commit. One thing to check out is whether the lmbench numbers are correct. Especially on SMP systems, the lmbench numbers are actually *best* when the two processes run on the same CPU, even though that's not really at all the best scheduling - it's just that it artificially improves lmbench numbers because of the close cache affinity for the pipe data structures. So when running the lmbench scheduling benchmarks on SMP, it actually makes sense to run them *pinned* to one CPU, because then you see the true scheduler performance. Otherwise you easily get noise due to balancing issues, and a clearly better scheduler can in fact generate worse numbers for lmbench. Did you do that? It's at least worth testing. I'm not saying it's the case here, but it's one reason why lmbench3 has the option to either keep processes on the same CPU or force them to spread out (and both cases are very interesting for scheduler testing, and tell different things: the pin them to the same CPU shows the latency on one runqueue, while the pin them to different CPU's shows the latency of a remote wakeup). IOW, while we used the lmbench scheduling benchmark pretty extensively in early scheduler tuning, if you select the defaults (let the system just schedule processes on any CPU) the end result really isn't necessarily a very meaningful value: getting the best lmbench numbers actually requires you to do things that tend to be actively *bad* in real life. Of course, a perfect scheduler would notice when two tasks are *so* closely related and only do synchronous wakups, that it would keep them on the same core, and get the best possible scores for lmbench, while not doing that for other real-life situations. So with a *really* smart scheduler, lmbench numbers would always be optimal, but I'm not sure aiming for that kind of perfection is even worth it! Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lmbench ctxsw regression with CFS
On Wed, Aug 01, 2007 at 07:31:26PM -0700, Linus Torvalds wrote: On Thu, 2 Aug 2007, Nick Piggin wrote: lmbench 3 lat_ctx context switching time with 2 processes bound to a single core increases by between 25%-35% on my Core2 system (didn't do enough runs to get more significance, but it is around 30%). The problem bisected to the main CFS commit. One thing to check out is whether the lmbench numbers are correct. Especially on SMP systems, the lmbench numbers are actually *best* when the two processes run on the same CPU, even though that's not really at all the best scheduling - it's just that it artificially improves lmbench numbers because of the close cache affinity for the pipe data structures. Yes, I bound them to a single core. So when running the lmbench scheduling benchmarks on SMP, it actually makes sense to run them *pinned* to one CPU, because then you see the true scheduler performance. Otherwise you easily get noise due to balancing issues, and a clearly better scheduler can in fact generate worse numbers for lmbench. Did you do that? It's at least worth testing. I'm not saying it's the case here, but it's one reason why lmbench3 has the option to either keep processes on the same CPU or force them to spread out (and both cases are very interesting for scheduler testing, and tell different things: the pin them to the same CPU shows the latency on one runqueue, while the pin them to different CPU's shows the latency of a remote wakeup). IOW, while we used the lmbench scheduling benchmark pretty extensively in early scheduler tuning, if you select the defaults (let the system just schedule processes on any CPU) the end result really isn't necessarily a very meaningful value: getting the best lmbench numbers actually requires you to do things that tend to be actively *bad* in real life. Of course, a perfect scheduler would notice when two tasks are *so* closely related and only do synchronous wakups, that it would keep them on the same core, and get the best possible scores for lmbench, while not doing that for other real-life situations. So with a *really* smart scheduler, lmbench numbers would always be optimal, but I'm not sure aiming for that kind of perfection is even worth it! Agreed with all your comments on multiprocessor balancing, but that was eliminated in these tests. I remote wakeup latency is another thing I want to test, but it isn't so interesting until the serial regression is fixed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/