Re: lmbench ctxsw regression with CFS

2007-08-16 Thread Siddha, Suresh B
On Tue, Aug 14, 2007 at 05:23:00AM +0200, Nick Piggin wrote:
> On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote:
> > Put it this way: if a 50% slowdown in context switch times yields a 5%
> > improvement in, say, balancing decisions then it's probably a net win.
> > 
> > Guys, repeat after me: "context switch is not a fast path".  Take that
> > benchmark and set fire to it.
> 
> It definitely can be. For workloads that are inherently asynchronous, high
> speed networking or disk IO (ie. with event generation significantly outside
> the control of the kernel or app), then it can be. Sure, you may just be
> switching between the main working thread and idle thread, but in that case a
> slowdown in the scheduler will be _more_ pronounced because you don't have to
> do as much work to actually switch contexts.
> 
> If there was a performance tradeoff involved, then we could think about it,
> and you might be right. But this is just a case of "write code to do direct
> calls or do indirect calls".
> 
> Ken Chen's last ia64 database benchmark I could find says schedule takes
> 6.5% of the clock cycles, the second highest consumer. Considering the
> lengths he was going to shave cycles off other paths, I'd call schedule()
> a fastpath. Would be really interesting to rerun that benchmark with CFS.
> Is anyone at Intel still doing those tests?

Yes. schedule() still is in the top 2-3 consumers of kernel time for that
workload. We did some tests when CFS was in initial days (I think V2 or so)
and it didn't show any regression.

We have plans to run that workload with 2.6.23-rc kernels, but other things
were taking priority so far...

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-16 Thread Siddha, Suresh B
On Tue, Aug 14, 2007 at 05:23:00AM +0200, Nick Piggin wrote:
 On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote:
  Put it this way: if a 50% slowdown in context switch times yields a 5%
  improvement in, say, balancing decisions then it's probably a net win.
  
  Guys, repeat after me: context switch is not a fast path.  Take that
  benchmark and set fire to it.
 
 It definitely can be. For workloads that are inherently asynchronous, high
 speed networking or disk IO (ie. with event generation significantly outside
 the control of the kernel or app), then it can be. Sure, you may just be
 switching between the main working thread and idle thread, but in that case a
 slowdown in the scheduler will be _more_ pronounced because you don't have to
 do as much work to actually switch contexts.
 
 If there was a performance tradeoff involved, then we could think about it,
 and you might be right. But this is just a case of write code to do direct
 calls or do indirect calls.
 
 Ken Chen's last ia64 database benchmark I could find says schedule takes
 6.5% of the clock cycles, the second highest consumer. Considering the
 lengths he was going to shave cycles off other paths, I'd call schedule()
 a fastpath. Would be really interesting to rerun that benchmark with CFS.
 Is anyone at Intel still doing those tests?

Yes. schedule() still is in the top 2-3 consumers of kernel time for that
workload. We did some tests when CFS was in initial days (I think V2 or so)
and it didn't show any regression.

We have plans to run that workload with 2.6.23-rc kernels, but other things
were taking priority so far...

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-13 Thread David Miller
From: Andrew Morton <[EMAIL PROTECTED]>
Date: Mon, 13 Aug 2007 20:00:38 -0700

> Guys, repeat after me: "context switch is not a fast path".  Take
> that benchmark and set fire to it.

Nothing in this world is so absolute :-)

Regardless of the value of lat_ctx, we should thank it for showing
that something is more expensive now.  And it definitely warrants
figuring out what paths are big cycle eaters now and why.

Not "on IA64 indirect calls are expensive, so that must be it", but
rather doing cycle analysis of the relevant scheduling functions to
figure out what might be wrong.

I'm willing to bet it's something completely trivial and easy
to amend, rather than a fundamental issue.  But somebody has
to _look_ instead of supposing this and that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-13 Thread Nick Piggin
On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote:
> On Mon, 13 Aug 2007 14:30:31 +0200 Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, Aug 06 2007, Nick Piggin wrote:
> > > > > What CPU did you get these numbers on? Do the indirect calls hurt 
> > > > > much 
> > > > > on those without an indirect predictor? (I'll try running some tests).
> > > > 
> > > > it was on an older Athlon64 X2. I never saw indirect calls really 
> > > > hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
> > > > efficiently? (as long as the target function is always the same - which 
> > > > it is here.)
> > > 
> > > I think a lot of CPUs do. I think ia64 does not. It predicts
> > > based on the contents of a branch target register which has to
> > > be loaded I presume before instructoin fetch reaches the branch.
> > > I don't know if this would hurt or not.
> > 
> > Testing on ia64 showed that the indirect calls in the io scheduler hurt
> > quite a bit, so I'd be surprised if the impact here wasn't an issue
> > there.
> 
> With what workload?  lmbench ctxsw?  Who cares?
> 
> Look, if you're doing 100,000 context switches per second per then *that*
> is your problem.  You suck, and making context switches a bit faster
> doesn't stop you from sucking.  And ten microseconds is a very long time
> indeed.
> 
> Put it this way: if a 50% slowdown in context switch times yields a 5%
> improvement in, say, balancing decisions then it's probably a net win.
> 
> Guys, repeat after me: "context switch is not a fast path".  Take that
> benchmark and set fire to it.

It definitely can be. For workloads that are inherently asynchronous, high
speed networking or disk IO (ie. with event generation significantly outside
the control of the kernel or app), then it can be. Sure, you may just be
switching between the main working thread and idle thread, but in that case a
slowdown in the scheduler will be _more_ pronounced because you don't have to
do as much work to actually switch contexts.

If there was a performance tradeoff involved, then we could think about it,
and you might be right. But this is just a case of "write code to do direct
calls or do indirect calls".

Ken Chen's last ia64 database benchmark I could find says schedule takes
6.5% of the clock cycles, the second highest consumer. Considering the
lengths he was going to shave cycles off other paths, I'd call schedule()
a fastpath. Would be really interesting to rerun that benchmark with CFS.
Is anyone at Intel still doing those tests?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-13 Thread Andrew Morton
On Mon, 13 Aug 2007 14:30:31 +0200 Jens Axboe <[EMAIL PROTECTED]> wrote:

> On Mon, Aug 06 2007, Nick Piggin wrote:
> > > > What CPU did you get these numbers on? Do the indirect calls hurt much 
> > > > on those without an indirect predictor? (I'll try running some tests).
> > > 
> > > it was on an older Athlon64 X2. I never saw indirect calls really 
> > > hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
> > > efficiently? (as long as the target function is always the same - which 
> > > it is here.)
> > 
> > I think a lot of CPUs do. I think ia64 does not. It predicts
> > based on the contents of a branch target register which has to
> > be loaded I presume before instructoin fetch reaches the branch.
> > I don't know if this would hurt or not.
> 
> Testing on ia64 showed that the indirect calls in the io scheduler hurt
> quite a bit, so I'd be surprised if the impact here wasn't an issue
> there.

With what workload?  lmbench ctxsw?  Who cares?

Look, if you're doing 100,000 context switches per second per then *that*
is your problem.  You suck, and making context switches a bit faster
doesn't stop you from sucking.  And ten microseconds is a very long time
indeed.

Put it this way: if a 50% slowdown in context switch times yields a 5%
improvement in, say, balancing decisions then it's probably a net win.

Guys, repeat after me: "context switch is not a fast path".  Take that
benchmark and set fire to it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-13 Thread Jens Axboe
On Mon, Aug 06 2007, Nick Piggin wrote:
> > > What CPU did you get these numbers on? Do the indirect calls hurt much 
> > > on those without an indirect predictor? (I'll try running some tests).
> > 
> > it was on an older Athlon64 X2. I never saw indirect calls really 
> > hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
> > efficiently? (as long as the target function is always the same - which 
> > it is here.)
> 
> I think a lot of CPUs do. I think ia64 does not. It predicts
> based on the contents of a branch target register which has to
> be loaded I presume before instructoin fetch reaches the branch.
> I don't know if this would hurt or not.

Testing on ia64 showed that the indirect calls in the io scheduler hurt
quite a bit, so I'd be surprised if the impact here wasn't an issue
there.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-13 Thread Jens Axboe
On Mon, Aug 06 2007, Nick Piggin wrote:
   What CPU did you get these numbers on? Do the indirect calls hurt much 
   on those without an indirect predictor? (I'll try running some tests).
  
  it was on an older Athlon64 X2. I never saw indirect calls really 
  hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
  efficiently? (as long as the target function is always the same - which 
  it is here.)
 
 I think a lot of CPUs do. I think ia64 does not. It predicts
 based on the contents of a branch target register which has to
 be loaded I presume before instructoin fetch reaches the branch.
 I don't know if this would hurt or not.

Testing on ia64 showed that the indirect calls in the io scheduler hurt
quite a bit, so I'd be surprised if the impact here wasn't an issue
there.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-13 Thread Andrew Morton
On Mon, 13 Aug 2007 14:30:31 +0200 Jens Axboe [EMAIL PROTECTED] wrote:

 On Mon, Aug 06 2007, Nick Piggin wrote:
What CPU did you get these numbers on? Do the indirect calls hurt much 
on those without an indirect predictor? (I'll try running some tests).
   
   it was on an older Athlon64 X2. I never saw indirect calls really 
   hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
   efficiently? (as long as the target function is always the same - which 
   it is here.)
  
  I think a lot of CPUs do. I think ia64 does not. It predicts
  based on the contents of a branch target register which has to
  be loaded I presume before instructoin fetch reaches the branch.
  I don't know if this would hurt or not.
 
 Testing on ia64 showed that the indirect calls in the io scheduler hurt
 quite a bit, so I'd be surprised if the impact here wasn't an issue
 there.

With what workload?  lmbench ctxsw?  Who cares?

Look, if you're doing 100,000 context switches per second per then *that*
is your problem.  You suck, and making context switches a bit faster
doesn't stop you from sucking.  And ten microseconds is a very long time
indeed.

Put it this way: if a 50% slowdown in context switch times yields a 5%
improvement in, say, balancing decisions then it's probably a net win.

Guys, repeat after me: context switch is not a fast path.  Take that
benchmark and set fire to it.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-13 Thread Nick Piggin
On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote:
 On Mon, 13 Aug 2007 14:30:31 +0200 Jens Axboe [EMAIL PROTECTED] wrote:
 
  On Mon, Aug 06 2007, Nick Piggin wrote:
 What CPU did you get these numbers on? Do the indirect calls hurt 
 much 
 on those without an indirect predictor? (I'll try running some tests).

it was on an older Athlon64 X2. I never saw indirect calls really 
hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
efficiently? (as long as the target function is always the same - which 
it is here.)
   
   I think a lot of CPUs do. I think ia64 does not. It predicts
   based on the contents of a branch target register which has to
   be loaded I presume before instructoin fetch reaches the branch.
   I don't know if this would hurt or not.
  
  Testing on ia64 showed that the indirect calls in the io scheduler hurt
  quite a bit, so I'd be surprised if the impact here wasn't an issue
  there.
 
 With what workload?  lmbench ctxsw?  Who cares?
 
 Look, if you're doing 100,000 context switches per second per then *that*
 is your problem.  You suck, and making context switches a bit faster
 doesn't stop you from sucking.  And ten microseconds is a very long time
 indeed.
 
 Put it this way: if a 50% slowdown in context switch times yields a 5%
 improvement in, say, balancing decisions then it's probably a net win.
 
 Guys, repeat after me: context switch is not a fast path.  Take that
 benchmark and set fire to it.

It definitely can be. For workloads that are inherently asynchronous, high
speed networking or disk IO (ie. with event generation significantly outside
the control of the kernel or app), then it can be. Sure, you may just be
switching between the main working thread and idle thread, but in that case a
slowdown in the scheduler will be _more_ pronounced because you don't have to
do as much work to actually switch contexts.

If there was a performance tradeoff involved, then we could think about it,
and you might be right. But this is just a case of write code to do direct
calls or do indirect calls.

Ken Chen's last ia64 database benchmark I could find says schedule takes
6.5% of the clock cycles, the second highest consumer. Considering the
lengths he was going to shave cycles off other paths, I'd call schedule()
a fastpath. Would be really interesting to rerun that benchmark with CFS.
Is anyone at Intel still doing those tests?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-13 Thread David Miller
From: Andrew Morton [EMAIL PROTECTED]
Date: Mon, 13 Aug 2007 20:00:38 -0700

 Guys, repeat after me: context switch is not a fast path.  Take
 that benchmark and set fire to it.

Nothing in this world is so absolute :-)

Regardless of the value of lat_ctx, we should thank it for showing
that something is more expensive now.  And it definitely warrants
figuring out what paths are big cycle eaters now and why.

Not on IA64 indirect calls are expensive, so that must be it, but
rather doing cycle analysis of the relevant scheduling functions to
figure out what might be wrong.

I'm willing to bet it's something completely trivial and easy
to amend, rather than a fundamental issue.  But somebody has
to _look_ instead of supposing this and that.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-05 Thread Nick Piggin
On Sat, Aug 04, 2007 at 08:50:37AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > Oh good. Thanks for getting to the bottom of it. We have normally 
> > disliked too much runtime tunables in the scheduler, so I assume these 
> > are mostly going away or under a CONFIG option for 2.6.23? Or...?
> 
> yeah, they are all already under CONFIG_SCHED_DEBUG. (it's just that the 
> add-on optimization is not upstream yet - the tunings are still being 

Ah, OK. So long as that goes upstream I'm happy... and it is good
to see that with that patch, the base context switching performance
_has_ actually gone up like I had hoped. Nice.


> tested) Btw., with SCHED_DEBUG we now also have your domain-tree sysctl 
> patch upstream, which has been in -mm for a near eternity.
> 
> > What CPU did you get these numbers on? Do the indirect calls hurt much 
> > on those without an indirect predictor? (I'll try running some tests).
> 
> it was on an older Athlon64 X2. I never saw indirect calls really 
> hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
> efficiently? (as long as the target function is always the same - which 
> it is here.)

I think a lot of CPUs do. I think ia64 does not. It predicts
based on the contents of a branch target register which has to
be loaded I presume before instructoin fetch reaches the branch.
I don't know if this would hurt or not.


> > I must say that I don't really like the indirect calls a great deal, 
> > and they could be eliminated just with a couple of branches and direct 
> > calls.
> 
> yeah - i'll try that too. We can make the indirect call the uncommon 
> case and a NULL pointer be the common case, combined with a 'default', 
> direct function call. But i doubt it makes a big (or even measurable) 
> difference.

You might be right there.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-05 Thread Nick Piggin
On Sat, Aug 04, 2007 at 08:50:37AM +0200, Ingo Molnar wrote:
 
 * Nick Piggin [EMAIL PROTECTED] wrote:
 
  Oh good. Thanks for getting to the bottom of it. We have normally 
  disliked too much runtime tunables in the scheduler, so I assume these 
  are mostly going away or under a CONFIG option for 2.6.23? Or...?
 
 yeah, they are all already under CONFIG_SCHED_DEBUG. (it's just that the 
 add-on optimization is not upstream yet - the tunings are still being 

Ah, OK. So long as that goes upstream I'm happy... and it is good
to see that with that patch, the base context switching performance
_has_ actually gone up like I had hoped. Nice.


 tested) Btw., with SCHED_DEBUG we now also have your domain-tree sysctl 
 patch upstream, which has been in -mm for a near eternity.
 
  What CPU did you get these numbers on? Do the indirect calls hurt much 
  on those without an indirect predictor? (I'll try running some tests).
 
 it was on an older Athlon64 X2. I never saw indirect calls really 
 hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
 efficiently? (as long as the target function is always the same - which 
 it is here.)

I think a lot of CPUs do. I think ia64 does not. It predicts
based on the contents of a branch target register which has to
be loaded I presume before instructoin fetch reaches the branch.
I don't know if this would hurt or not.


  I must say that I don't really like the indirect calls a great deal, 
  and they could be eliminated just with a couple of branches and direct 
  calls.
 
 yeah - i'll try that too. We can make the indirect call the uncommon 
 case and a NULL pointer be the common case, combined with a 'default', 
 direct function call. But i doubt it makes a big (or even measurable) 
 difference.

You might be right there.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-04 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> Oh good. Thanks for getting to the bottom of it. We have normally 
> disliked too much runtime tunables in the scheduler, so I assume these 
> are mostly going away or under a CONFIG option for 2.6.23? Or...?

yeah, they are all already under CONFIG_SCHED_DEBUG. (it's just that the 
add-on optimization is not upstream yet - the tunings are still being 
tested) Btw., with SCHED_DEBUG we now also have your domain-tree sysctl 
patch upstream, which has been in -mm for a near eternity.

> What CPU did you get these numbers on? Do the indirect calls hurt much 
> on those without an indirect predictor? (I'll try running some tests).

it was on an older Athlon64 X2. I never saw indirect calls really 
hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
efficiently? (as long as the target function is always the same - which 
it is here.)

> I must say that I don't really like the indirect calls a great deal, 
> and they could be eliminated just with a couple of branches and direct 
> calls.

yeah - i'll try that too. We can make the indirect call the uncommon 
case and a NULL pointer be the common case, combined with a 'default', 
direct function call. But i doubt it makes a big (or even measurable) 
difference.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-04 Thread Ingo Molnar

* Nick Piggin [EMAIL PROTECTED] wrote:

 Oh good. Thanks for getting to the bottom of it. We have normally 
 disliked too much runtime tunables in the scheduler, so I assume these 
 are mostly going away or under a CONFIG option for 2.6.23? Or...?

yeah, they are all already under CONFIG_SCHED_DEBUG. (it's just that the 
add-on optimization is not upstream yet - the tunings are still being 
tested) Btw., with SCHED_DEBUG we now also have your domain-tree sysctl 
patch upstream, which has been in -mm for a near eternity.

 What CPU did you get these numbers on? Do the indirect calls hurt much 
 on those without an indirect predictor? (I'll try running some tests).

it was on an older Athlon64 X2. I never saw indirect calls really 
hurting on modern x86 CPUs - dont both CPU makers optimize them pretty 
efficiently? (as long as the target function is always the same - which 
it is here.)

 I must say that I don't really like the indirect calls a great deal, 
 and they could be eliminated just with a couple of branches and direct 
 calls.

yeah - i'll try that too. We can make the indirect call the uncommon 
case and a NULL pointer be the common case, combined with a 'default', 
direct function call. But i doubt it makes a big (or even measurable) 
difference.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-02 Thread Nick Piggin
On Thu, Aug 02, 2007 at 05:44:47PM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > > > > One thing to check out is whether the lmbench numbers are 
> > > > > "correct". Especially on SMP systems, the lmbench numbers are 
> > > > > actually *best* when the two processes run on the same CPU, even 
> > > > > though that's not really at all the best scheduling - it's just 
> > > > > that it artificially improves lmbench numbers because of the 
> > > > > close cache affinity for the pipe data structures.
> > > > 
> > > > Yes, I bound them to a single core.
> > > 
> > > could you send me the .config you used?
> > 
> > Sure, attached...
> > 
> > You don't see a regression? If not, then can you send me the .config 
> > you used? [...]
> 
> i used your config to get a few numbers and to see what happens. Here's 
> the numbers of 10 consecutive "lat_ctx -s 0 2" runs:
> 
> [ time in micro-seconds, smaller is better ]
> 
> v2.6.22 v2.6.23-git  v2.6.23-git+const-param
> --- ---  ---
>  1.30  1.60   1.19
>  1.30  1.36   1.18
>  1.14  1.50   1.01
>  1.26  1.27   1.23
>  1.22  1.40   1.04
>  1.13  1.34   1.09
>  1.27  1.39   1.05
>  1.20  1.30   1.16
>  1.20  1.17   1.16
>  1.25  1.33   1.01
>-
>   avg:   1.22  1.36 (+11.3%)  1.11 (-10.3%)
>   min:   1.13  1.17 ( +3.5%)  1.01 (-11.8%)
>   max:   1.27  1.60 (+26.0%)  1.23 ( -3.2%)
> 
> one reason for the extra overhead is the current tunability of CFS, but 
> that is not fundamental, it's caused by the many knobs that CFS has at 
> the moment. The const-tuning patch (attached below, results in the 
> rightmost column) changes those knobs to constants, allowing the 
> compiler to optimize the math better and reduce code size. (the code 
> movement in the patch makes up for most of its size, the change that it 
> does is simple otherwise.)

[...]

Oh good. Thanks for getting to the bottom of it. We have normally
disliked too much runtime tunables in the scheduler, so I assume
these are mostly going away or under a CONFIG option for 2.6.23?
Or...?

What CPU did you get these numbers on? Do the indirect calls hurt
much on those without an indirect predictor? (I'll try running some
tests).

I must say that I don't really like the indirect calls a great deal,
and they could be eliminated just with a couple of branches and
direct calls.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-02 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> > > > One thing to check out is whether the lmbench numbers are 
> > > > "correct". Especially on SMP systems, the lmbench numbers are 
> > > > actually *best* when the two processes run on the same CPU, even 
> > > > though that's not really at all the best scheduling - it's just 
> > > > that it artificially improves lmbench numbers because of the 
> > > > close cache affinity for the pipe data structures.
> > > 
> > > Yes, I bound them to a single core.
> > 
> > could you send me the .config you used?
> 
> Sure, attached...
> 
> You don't see a regression? If not, then can you send me the .config 
> you used? [...]

i used your config to get a few numbers and to see what happens. Here's 
the numbers of 10 consecutive "lat_ctx -s 0 2" runs:

[ time in micro-seconds, smaller is better ]

v2.6.22 v2.6.23-git  v2.6.23-git+const-param
--- ---  ---
 1.30  1.60   1.19
 1.30  1.36   1.18
 1.14  1.50   1.01
 1.26  1.27   1.23
 1.22  1.40   1.04
 1.13  1.34   1.09
 1.27  1.39   1.05
 1.20  1.30   1.16
 1.20  1.17   1.16
 1.25  1.33   1.01
   -
  avg:   1.22  1.36 (+11.3%)  1.11 (-10.3%)
  min:   1.13  1.17 ( +3.5%)  1.01 (-11.8%)
  max:   1.27  1.60 (+26.0%)  1.23 ( -3.2%)

one reason for the extra overhead is the current tunability of CFS, but 
that is not fundamental, it's caused by the many knobs that CFS has at 
the moment. The const-tuning patch (attached below, results in the 
rightmost column) changes those knobs to constants, allowing the 
compiler to optimize the math better and reduce code size. (the code 
movement in the patch makes up for most of its size, the change that it 
does is simple otherwise.)

so CFS can be faster at micro-context-switching than 2.6.22. But, at 
this point i'd also like to warn against putting _too_ much emphasis on 
lat_ctx numbers in general. lat_ctx prints a 'derived' micro-benchmark 
number. It uses a pair of pipes to context-switch between tasks but only 
prints the delta overhead that context-switching causes. The 'full' 
latency of the pipe operations can be seen via the following pipe-test.c 
code:

   http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test.c

run it to see the full cost:

   neptune:~> ./pipe-test
   4.67 usecs/loop.
   4.41 usecs/loop.
   4.46 usecs/loop.
   4.46 usecs/loop.
   4.44 usecs/loop.
   4.41 usecs/loop.

so the _full_ cost, of even this micro-benchmark, is 4-5 microseconds, 
not 1 microsecond. So even this artificial micro-benchmark sees an 
actual slowdown of only 2.8%.

if you check a macro-benchmark like "hackbench 50":

 [ time in seconds, smaller is better ]

 v2.6.22  v2.6.23-cfs
 ---  ---
  3.019  2.842
  2.994  2.878
  2.977  2.882
  3.012  2.864
  2.996  2.882

then the difference is even starker because with CFS the _quality_ of 
scheduling decisions has increased. So even if we had increased 
micro-costs (which we wont have once the current tuning period is over 
and we cast the CFS parameters into constants), the quality of 
macro-scheduling can offset that, and not only on the desktop!

so that's why our main focus in CFS was on the macro-properties of 
scheduling _first_, and then the micro-properties are adjusted to the 
macro-constraints as a second layer.

Ingo

->
---
 include/linux/sched.h |2 
 kernel/sched.c|  143 +-
 kernel/sched_fair.c   |   27 +
 kernel/sched_rt.c |   10 ---
 4 files changed, 92 insertions(+), 90 deletions(-)

Index: linux/include/linux/sched.h
===
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1396,6 +1396,7 @@ static inline void idle_task_exit(void) 
 
 extern void sched_idle_next(void);
 
+#ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_batch_wakeup_granularity;
@@ -1403,6 +1404,7 @@ extern unsigned int sysctl_sched_stat_gr
 extern unsigned int sysctl_sched_runtime_limit;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned 

Re: lmbench ctxsw regression with CFS

2007-08-02 Thread Nick Piggin
On Thu, Aug 02, 2007 at 09:19:56AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > > One thing to check out is whether the lmbench numbers are "correct". 
> > > Especially on SMP systems, the lmbench numbers are actually *best* 
> > > when the two processes run on the same CPU, even though that's not 
> > > really at all the best scheduling - it's just that it artificially 
> > > improves lmbench numbers because of the close cache affinity for the 
> > > pipe data structures.
> > 
> > Yes, I bound them to a single core.
> 
> could you send me the .config you used?

Sure, attached...

You don't see a regression? If not, then can you send me the .config you
used? Also what CPU architecture (when I tested an older CFS on a P4 IIRC
the regression was much bigger like 100% more costly).

---

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.23-rc1
# Tue Jul 31 21:43:43 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_NR_QUICK=2
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=18
# CONFIG_CPUSETS is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
# CONFIG_BLK_DEV_INITRD is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
# CONFIG_UID16 is not set
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_KMOD is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_BLK_DEV_BSG=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
CONFIG_MCORE2=y
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_HT=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_NR_CPUS=2
CONFIG_PHYSICAL_ALIGN=0x20
CONFIG_HOTPLUG_CPU=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
# CONFIG_IOMMU is not set
# CONFIG_CALGARY_IOMMU is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
# CONFIG_X86_MCE_AMD is not set
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_START=0x20
# CONFIG_SECCOMP is not set
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_HZ_100=y
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y

Re: lmbench ctxsw regression with CFS

2007-08-02 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> > One thing to check out is whether the lmbench numbers are "correct". 
> > Especially on SMP systems, the lmbench numbers are actually *best* 
> > when the two processes run on the same CPU, even though that's not 
> > really at all the best scheduling - it's just that it artificially 
> > improves lmbench numbers because of the close cache affinity for the 
> > pipe data structures.
> 
> Yes, I bound them to a single core.

could you send me the .config you used?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-02 Thread Ingo Molnar

* Nick Piggin [EMAIL PROTECTED] wrote:

  One thing to check out is whether the lmbench numbers are correct. 
  Especially on SMP systems, the lmbench numbers are actually *best* 
  when the two processes run on the same CPU, even though that's not 
  really at all the best scheduling - it's just that it artificially 
  improves lmbench numbers because of the close cache affinity for the 
  pipe data structures.
 
 Yes, I bound them to a single core.

could you send me the .config you used?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-02 Thread Nick Piggin
On Thu, Aug 02, 2007 at 09:19:56AM +0200, Ingo Molnar wrote:
 
 * Nick Piggin [EMAIL PROTECTED] wrote:
 
   One thing to check out is whether the lmbench numbers are correct. 
   Especially on SMP systems, the lmbench numbers are actually *best* 
   when the two processes run on the same CPU, even though that's not 
   really at all the best scheduling - it's just that it artificially 
   improves lmbench numbers because of the close cache affinity for the 
   pipe data structures.
  
  Yes, I bound them to a single core.
 
 could you send me the .config you used?

Sure, attached...

You don't see a regression? If not, then can you send me the .config you
used? Also what CPU architecture (when I tested an older CFS on a P4 IIRC
the regression was much bigger like 100% more costly).

---

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.23-rc1
# Tue Jul 31 21:43:43 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_NR_QUICK=2
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=18
# CONFIG_CPUSETS is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
# CONFIG_BLK_DEV_INITRD is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
# CONFIG_UID16 is not set
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_KMOD is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_BLK_DEV_BSG=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=anticipatory

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
CONFIG_MCORE2=y
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_HT=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_NR_CPUS=2
CONFIG_PHYSICAL_ALIGN=0x20
CONFIG_HOTPLUG_CPU=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
# CONFIG_IOMMU is not set
# CONFIG_CALGARY_IOMMU is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
# CONFIG_X86_MCE_AMD is not set
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_START=0x20
# CONFIG_SECCOMP is not set
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_HZ_100=y
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y
CONFIG_GENERIC_PENDING_IRQ=y

#
# Power 

Re: lmbench ctxsw regression with CFS

2007-08-02 Thread Ingo Molnar

* Nick Piggin [EMAIL PROTECTED] wrote:

One thing to check out is whether the lmbench numbers are 
correct. Especially on SMP systems, the lmbench numbers are 
actually *best* when the two processes run on the same CPU, even 
though that's not really at all the best scheduling - it's just 
that it artificially improves lmbench numbers because of the 
close cache affinity for the pipe data structures.
   
   Yes, I bound them to a single core.
  
  could you send me the .config you used?
 
 Sure, attached...
 
 You don't see a regression? If not, then can you send me the .config 
 you used? [...]

i used your config to get a few numbers and to see what happens. Here's 
the numbers of 10 consecutive lat_ctx -s 0 2 runs:

[ time in micro-seconds, smaller is better ]

v2.6.22 v2.6.23-git  v2.6.23-git+const-param
--- ---  ---
 1.30  1.60   1.19
 1.30  1.36   1.18
 1.14  1.50   1.01
 1.26  1.27   1.23
 1.22  1.40   1.04
 1.13  1.34   1.09
 1.27  1.39   1.05
 1.20  1.30   1.16
 1.20  1.17   1.16
 1.25  1.33   1.01
   -
  avg:   1.22  1.36 (+11.3%)  1.11 (-10.3%)
  min:   1.13  1.17 ( +3.5%)  1.01 (-11.8%)
  max:   1.27  1.60 (+26.0%)  1.23 ( -3.2%)

one reason for the extra overhead is the current tunability of CFS, but 
that is not fundamental, it's caused by the many knobs that CFS has at 
the moment. The const-tuning patch (attached below, results in the 
rightmost column) changes those knobs to constants, allowing the 
compiler to optimize the math better and reduce code size. (the code 
movement in the patch makes up for most of its size, the change that it 
does is simple otherwise.)

so CFS can be faster at micro-context-switching than 2.6.22. But, at 
this point i'd also like to warn against putting _too_ much emphasis on 
lat_ctx numbers in general. lat_ctx prints a 'derived' micro-benchmark 
number. It uses a pair of pipes to context-switch between tasks but only 
prints the delta overhead that context-switching causes. The 'full' 
latency of the pipe operations can be seen via the following pipe-test.c 
code:

   http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test.c

run it to see the full cost:

   neptune:~ ./pipe-test
   4.67 usecs/loop.
   4.41 usecs/loop.
   4.46 usecs/loop.
   4.46 usecs/loop.
   4.44 usecs/loop.
   4.41 usecs/loop.

so the _full_ cost, of even this micro-benchmark, is 4-5 microseconds, 
not 1 microsecond. So even this artificial micro-benchmark sees an 
actual slowdown of only 2.8%.

if you check a macro-benchmark like hackbench 50:

 [ time in seconds, smaller is better ]

 v2.6.22  v2.6.23-cfs
 ---  ---
  3.019  2.842
  2.994  2.878
  2.977  2.882
  3.012  2.864
  2.996  2.882

then the difference is even starker because with CFS the _quality_ of 
scheduling decisions has increased. So even if we had increased 
micro-costs (which we wont have once the current tuning period is over 
and we cast the CFS parameters into constants), the quality of 
macro-scheduling can offset that, and not only on the desktop!

so that's why our main focus in CFS was on the macro-properties of 
scheduling _first_, and then the micro-properties are adjusted to the 
macro-constraints as a second layer.

Ingo

-
---
 include/linux/sched.h |2 
 kernel/sched.c|  143 +-
 kernel/sched_fair.c   |   27 +
 kernel/sched_rt.c |   10 ---
 4 files changed, 92 insertions(+), 90 deletions(-)

Index: linux/include/linux/sched.h
===
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1396,6 +1396,7 @@ static inline void idle_task_exit(void) 
 
 extern void sched_idle_next(void);
 
+#ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_batch_wakeup_granularity;
@@ -1403,6 +1404,7 @@ extern unsigned int sysctl_sched_stat_gr
 extern unsigned int sysctl_sched_runtime_limit;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned int sysctl_sched_features;
+#endif
 
 #ifdef 

Re: lmbench ctxsw regression with CFS

2007-08-02 Thread Nick Piggin
On Thu, Aug 02, 2007 at 05:44:47PM +0200, Ingo Molnar wrote:
 
 * Nick Piggin [EMAIL PROTECTED] wrote:
 
 One thing to check out is whether the lmbench numbers are 
 correct. Especially on SMP systems, the lmbench numbers are 
 actually *best* when the two processes run on the same CPU, even 
 though that's not really at all the best scheduling - it's just 
 that it artificially improves lmbench numbers because of the 
 close cache affinity for the pipe data structures.

Yes, I bound them to a single core.
   
   could you send me the .config you used?
  
  Sure, attached...
  
  You don't see a regression? If not, then can you send me the .config 
  you used? [...]
 
 i used your config to get a few numbers and to see what happens. Here's 
 the numbers of 10 consecutive lat_ctx -s 0 2 runs:
 
 [ time in micro-seconds, smaller is better ]
 
 v2.6.22 v2.6.23-git  v2.6.23-git+const-param
 --- ---  ---
  1.30  1.60   1.19
  1.30  1.36   1.18
  1.14  1.50   1.01
  1.26  1.27   1.23
  1.22  1.40   1.04
  1.13  1.34   1.09
  1.27  1.39   1.05
  1.20  1.30   1.16
  1.20  1.17   1.16
  1.25  1.33   1.01
-
   avg:   1.22  1.36 (+11.3%)  1.11 (-10.3%)
   min:   1.13  1.17 ( +3.5%)  1.01 (-11.8%)
   max:   1.27  1.60 (+26.0%)  1.23 ( -3.2%)
 
 one reason for the extra overhead is the current tunability of CFS, but 
 that is not fundamental, it's caused by the many knobs that CFS has at 
 the moment. The const-tuning patch (attached below, results in the 
 rightmost column) changes those knobs to constants, allowing the 
 compiler to optimize the math better and reduce code size. (the code 
 movement in the patch makes up for most of its size, the change that it 
 does is simple otherwise.)

[...]

Oh good. Thanks for getting to the bottom of it. We have normally
disliked too much runtime tunables in the scheduler, so I assume
these are mostly going away or under a CONFIG option for 2.6.23?
Or...?

What CPU did you get these numbers on? Do the indirect calls hurt
much on those without an indirect predictor? (I'll try running some
tests).

I must say that I don't really like the indirect calls a great deal,
and they could be eliminated just with a couple of branches and
direct calls.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-01 Thread Nick Piggin
On Wed, Aug 01, 2007 at 07:31:26PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 2 Aug 2007, Nick Piggin wrote:
> > 
> > lmbench 3 lat_ctx context switching time with 2 processes bound to a
> > single core increases by between 25%-35% on my Core2 system (didn't do
> > enough runs to get more significance, but it is around 30%). The problem
> > bisected to the main CFS commit.
> 
> One thing to check out is whether the lmbench numbers are "correct". 
> Especially on SMP systems, the lmbench numbers are actually *best* when 
> the two processes run on the same CPU, even though that's not really at 
> all the best scheduling - it's just that it artificially improves lmbench 
> numbers because of the close cache affinity for the pipe data structures.

Yes, I bound them to a single core.


> So when running the lmbench scheduling benchmarks on SMP, it actually 
> makes sense to run them *pinned* to one CPU, because then you see the true 
> scheduler performance. Otherwise you easily get noise due to balancing 
> issues, and a clearly better scheduler can in fact generate worse 
> numbers for lmbench.
> 
> Did you do that? It's at least worth testing. I'm not saying it's the case 
> here, but it's one reason why lmbench3 has the option to either keep 
> processes on the same CPU or force them to spread out (and both cases are 
> very interesting for scheduler testing, and tell different things: the 
> "pin them to the same CPU" shows the latency on one runqueue, while the 
> "pin them to different CPU's" shows the latency of a remote wakeup).
> 
> IOW, while we used the lmbench scheduling benchmark pretty extensively in 
> early scheduler tuning, if you select the defaults ("let the system just 
> schedule processes on any CPU") the end result really isn't necessarily a 
> very meaningful value: getting the best lmbench numbers actually requires 
> you to do things that tend to be actively *bad* in real life.
> 
> Of course, a perfect scheduler would notice when two tasks are *so* 
> closely related and only do synchronous wakups, that it would keep them on 
> the same core, and get the best possible scores for lmbench, while not 
> doing that for other real-life situations. So with a *really* smart 
> scheduler, lmbench numbers would always be optimal, but I'm not sure 
> aiming for that kind of perfection is even worth it!

Agreed with all your comments on multiprocessor balancing, but that
was eliminated in these tests. I remote wakeup latency is another thing
I want to test, but it isn't so interesting until the serial regression
is fixed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-01 Thread Linus Torvalds


On Thu, 2 Aug 2007, Nick Piggin wrote:
> 
> lmbench 3 lat_ctx context switching time with 2 processes bound to a
> single core increases by between 25%-35% on my Core2 system (didn't do
> enough runs to get more significance, but it is around 30%). The problem
> bisected to the main CFS commit.

One thing to check out is whether the lmbench numbers are "correct". 
Especially on SMP systems, the lmbench numbers are actually *best* when 
the two processes run on the same CPU, even though that's not really at 
all the best scheduling - it's just that it artificially improves lmbench 
numbers because of the close cache affinity for the pipe data structures.

So when running the lmbench scheduling benchmarks on SMP, it actually 
makes sense to run them *pinned* to one CPU, because then you see the true 
scheduler performance. Otherwise you easily get noise due to balancing 
issues, and a clearly better scheduler can in fact generate worse 
numbers for lmbench.

Did you do that? It's at least worth testing. I'm not saying it's the case 
here, but it's one reason why lmbench3 has the option to either keep 
processes on the same CPU or force them to spread out (and both cases are 
very interesting for scheduler testing, and tell different things: the 
"pin them to the same CPU" shows the latency on one runqueue, while the 
"pin them to different CPU's" shows the latency of a remote wakeup).

IOW, while we used the lmbench scheduling benchmark pretty extensively in 
early scheduler tuning, if you select the defaults ("let the system just 
schedule processes on any CPU") the end result really isn't necessarily a 
very meaningful value: getting the best lmbench numbers actually requires 
you to do things that tend to be actively *bad* in real life.

Of course, a perfect scheduler would notice when two tasks are *so* 
closely related and only do synchronous wakups, that it would keep them on 
the same core, and get the best possible scores for lmbench, while not 
doing that for other real-life situations. So with a *really* smart 
scheduler, lmbench numbers would always be optimal, but I'm not sure 
aiming for that kind of perfection is even worth it!

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


lmbench ctxsw regression with CFS

2007-08-01 Thread Nick Piggin
Hi,

I didn't follow all of the scheduler debates and flamewars, so apologies
if this was already covered. Anyway.

lmbench 3 lat_ctx context switching time with 2 processes bound to a
single core increases by between 25%-35% on my Core2 system (didn't do
enough runs to get more significance, but it is around 30%). The problem
bisected to the main CFS commit.

I was really hoping that a smaller runqueue data structure could actually
increase performance with the common case of small numbers of tasks :(

I assume this was a known issue before CFS was merged. Do you know what is
causing the slowdown? Any plans to fix it?

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


lmbench ctxsw regression with CFS

2007-08-01 Thread Nick Piggin
Hi,

I didn't follow all of the scheduler debates and flamewars, so apologies
if this was already covered. Anyway.

lmbench 3 lat_ctx context switching time with 2 processes bound to a
single core increases by between 25%-35% on my Core2 system (didn't do
enough runs to get more significance, but it is around 30%). The problem
bisected to the main CFS commit.

I was really hoping that a smaller runqueue data structure could actually
increase performance with the common case of small numbers of tasks :(

I assume this was a known issue before CFS was merged. Do you know what is
causing the slowdown? Any plans to fix it?

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-01 Thread Linus Torvalds


On Thu, 2 Aug 2007, Nick Piggin wrote:
 
 lmbench 3 lat_ctx context switching time with 2 processes bound to a
 single core increases by between 25%-35% on my Core2 system (didn't do
 enough runs to get more significance, but it is around 30%). The problem
 bisected to the main CFS commit.

One thing to check out is whether the lmbench numbers are correct. 
Especially on SMP systems, the lmbench numbers are actually *best* when 
the two processes run on the same CPU, even though that's not really at 
all the best scheduling - it's just that it artificially improves lmbench 
numbers because of the close cache affinity for the pipe data structures.

So when running the lmbench scheduling benchmarks on SMP, it actually 
makes sense to run them *pinned* to one CPU, because then you see the true 
scheduler performance. Otherwise you easily get noise due to balancing 
issues, and a clearly better scheduler can in fact generate worse 
numbers for lmbench.

Did you do that? It's at least worth testing. I'm not saying it's the case 
here, but it's one reason why lmbench3 has the option to either keep 
processes on the same CPU or force them to spread out (and both cases are 
very interesting for scheduler testing, and tell different things: the 
pin them to the same CPU shows the latency on one runqueue, while the 
pin them to different CPU's shows the latency of a remote wakeup).

IOW, while we used the lmbench scheduling benchmark pretty extensively in 
early scheduler tuning, if you select the defaults (let the system just 
schedule processes on any CPU) the end result really isn't necessarily a 
very meaningful value: getting the best lmbench numbers actually requires 
you to do things that tend to be actively *bad* in real life.

Of course, a perfect scheduler would notice when two tasks are *so* 
closely related and only do synchronous wakups, that it would keep them on 
the same core, and get the best possible scores for lmbench, while not 
doing that for other real-life situations. So with a *really* smart 
scheduler, lmbench numbers would always be optimal, but I'm not sure 
aiming for that kind of perfection is even worth it!

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lmbench ctxsw regression with CFS

2007-08-01 Thread Nick Piggin
On Wed, Aug 01, 2007 at 07:31:26PM -0700, Linus Torvalds wrote:
 
 
 On Thu, 2 Aug 2007, Nick Piggin wrote:
  
  lmbench 3 lat_ctx context switching time with 2 processes bound to a
  single core increases by between 25%-35% on my Core2 system (didn't do
  enough runs to get more significance, but it is around 30%). The problem
  bisected to the main CFS commit.
 
 One thing to check out is whether the lmbench numbers are correct. 
 Especially on SMP systems, the lmbench numbers are actually *best* when 
 the two processes run on the same CPU, even though that's not really at 
 all the best scheduling - it's just that it artificially improves lmbench 
 numbers because of the close cache affinity for the pipe data structures.

Yes, I bound them to a single core.


 So when running the lmbench scheduling benchmarks on SMP, it actually 
 makes sense to run them *pinned* to one CPU, because then you see the true 
 scheduler performance. Otherwise you easily get noise due to balancing 
 issues, and a clearly better scheduler can in fact generate worse 
 numbers for lmbench.
 
 Did you do that? It's at least worth testing. I'm not saying it's the case 
 here, but it's one reason why lmbench3 has the option to either keep 
 processes on the same CPU or force them to spread out (and both cases are 
 very interesting for scheduler testing, and tell different things: the 
 pin them to the same CPU shows the latency on one runqueue, while the 
 pin them to different CPU's shows the latency of a remote wakeup).
 
 IOW, while we used the lmbench scheduling benchmark pretty extensively in 
 early scheduler tuning, if you select the defaults (let the system just 
 schedule processes on any CPU) the end result really isn't necessarily a 
 very meaningful value: getting the best lmbench numbers actually requires 
 you to do things that tend to be actively *bad* in real life.
 
 Of course, a perfect scheduler would notice when two tasks are *so* 
 closely related and only do synchronous wakups, that it would keep them on 
 the same core, and get the best possible scores for lmbench, while not 
 doing that for other real-life situations. So with a *really* smart 
 scheduler, lmbench numbers would always be optimal, but I'm not sure 
 aiming for that kind of perfection is even worth it!

Agreed with all your comments on multiprocessor balancing, but that
was eliminated in these tests. I remote wakeup latency is another thing
I want to test, but it isn't so interesting until the serial regression
is fixed.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/