Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Tong Li

On Mon, 24 Sep 2007, Peter Zijlstra wrote:


On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote:


On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:

On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:


how about something like:

 s64 delta = (s64)(vruntime - min_vruntime);
 if (delta > 0)
  min_vruntime += delta;

That would rid us of most of the funny conditionals there.


That still left me with negative min_vruntimes.  The pinned hogs didn't
lock my box up, but I quickly got the below, so hastily killed it.


Shouldn't the max() in place_entity() be the max_vruntime() that my
lysdexia told me it was when I looked at it earlier? ;-)



probably, my tree doesn't have that max anymore so I'm not sure.



Last time I looked, I thought the max() in place_entity() was fine and the 
problem seemed to be in set_task_cpu() that was causing the negative 
vruntimes:


if (likely(new_rq->cfs.min_vruntime))
p->se.vruntime -= old_rq->cfs.min_vruntime -
new_rq->cfs.min_vruntime;

I think it's fine we get rid of sync_vruntime(). I need to think more 
about how to make global fairness work based on this--it seems to be more 
complicated than if we had sync_vruntime().


  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Peter Zijlstra
On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote:

> On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
> > On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
> > 
> > > how about something like:
> > > 
> > >  s64 delta = (s64)(vruntime - min_vruntime);
> > >  if (delta > 0)
> > >   min_vruntime += delta;
> > > 
> > > That would rid us of most of the funny conditionals there.
> > 
> > That still left me with negative min_vruntimes.  The pinned hogs didn't
> > lock my box up, but I quickly got the below, so hastily killed it.
> 
> Shouldn't the max() in place_entity() be the max_vruntime() that my
> lysdexia told me it was when I looked at it earlier? ;-)
> 

probably, my tree doesn't have that max anymore so I'm not sure.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Mike Galbraith
On Mon, 2007-09-24 at 13:08 +0200, Peter Zijlstra wrote:

> Its perfectly valid for min_vruntime to exist in 1ULL << 63.

But wrap backward timewarp is what's killing my box.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Mike Galbraith
On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
> On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
> 
> > how about something like:
> > 
> >  s64 delta = (s64)(vruntime - min_vruntime);
> >  if (delta > 0)
> >   min_vruntime += delta;
> > 
> > That would rid us of most of the funny conditionals there.
> 
> That still left me with negative min_vruntimes.  The pinned hogs didn't
> lock my box up, but I quickly got the below, so hastily killed it.

Shouldn't the max() in place_entity() be the max_vruntime() that my
lysdexia told me it was when I looked at it earlier? ;-)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Peter Zijlstra
On Mon, 24 Sep 2007 12:42:15 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote:

> On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
> 
> > how about something like:
> > 
> >  s64 delta = (s64)(vruntime - min_vruntime);
> >  if (delta > 0)
> >   min_vruntime += delta;
> > 
> > That would rid us of most of the funny conditionals there.
> 
> That still left me with negative min_vruntimes.  The pinned hogs didn't
> lock my box up, but I quickly got the below, so hastily killed it.
> 
> se.wait_max  : 7.846949
> se.wait_max  :   301.951601
> se.wait_max  : 7.071359
>

Odd, the idea (which I think is clear) is that min_vruntime can wrap
around the u64 spectrum. And by using min_vruntime as offset to base
the key around, we get a signed but limited range key-space. (because
we update min_vruntime to be the leftmost task (in a monotonic fashion))

So I'm having trouble with these patches, that is, both your wrap
around condition of:

  if (likely(new_rq->cfs.min_vruntime))

as well as the last patchlet:

  if (((s64)vruntime > (s64)min_vruntime) ||

in that neither of these changes make sense in what its trying to do.

Its perfectly valid for min_vruntime to exist in 1ULL << 63.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Mike Galbraith
On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:

> how about something like:
> 
>  s64 delta = (s64)(vruntime - min_vruntime);
>  if (delta > 0)
>   min_vruntime += delta;
> 
> That would rid us of most of the funny conditionals there.

That still left me with negative min_vruntimes.  The pinned hogs didn't
lock my box up, but I quickly got the below, so hastily killed it.

se.wait_max  : 7.846949
se.wait_max  :   301.951601
se.wait_max  : 7.071359

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Peter Zijlstra
On Mon, 24 Sep 2007 12:10:09 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote:


> @@ -117,7 +117,7 @@ static inline struct task_struct *task_o
>  static inline u64
>  max_vruntime(u64 min_vruntime, u64 vruntime)
>  {
> - if ((vruntime > min_vruntime) ||
> + if (((s64)vruntime > (s64)min_vruntime) ||
>   (min_vruntime > (1ULL << 61) && vruntime < (1ULL << 50)))
>   min_vruntime = vruntime;
>  

how about something like:

 s64 delta = (s64)(vruntime - min_vruntime);
 if (delta > 0)
  min_vruntime += delta;

That would rid us of most of the funny conditionals there.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Mike Galbraith
On Sun, 2007-09-23 at 23:21 -0700, Tong Li wrote: 
> On Sun, 23 Sep 2007, Mike Galbraith wrote:
> 
> > On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
> >> On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
> >>> Mike,
> >>>
> >>> Could you try this patch to see if it solves the latency problem?
> >>
> >> No, but it helps some when running two un-pinned busy loops, one at nice
> >> 0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
> >> _seconds_ doing this, and logging sched_debug and /proc/`pidof
> >> Xorg`/sched from SCHED_RR shells.
> >
> > Looking at a log (snippet attached) from this morning with the last hunk
> > of your patch still intact, it looks like min_vruntime is being modified
> > after a task is queued.  If you look at the snippet, you'll see the nice
> > 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
> > second later on CPU1 with it's vruntime at 2814952.425082, but
> > min_vruntime is 3061874.838356.
> 
> I think this could be what was happening: between the two seconds, CPU 0 
> becomes idle and it pulls the nice 19 task over via pull_task(), which 
> calls set_task_cpu(), which changes the task's vruntime to the current 
> min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 
> calls activate_task(), which calls enqueue_task() and in turn 
> update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets 
> called and CPU 0's min_vruntime gets synced to the system max. Thus, the 
> nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think 
> this can be fixed by adding the following code in set_task_cpu() before we 
> adjust p->vruntime:
> 
> if (!new_rq->cfs.nr_running)
>   sync_vruntime(new_rq);

Hmm.  I can imagine Mondo-Boxen-R-Us folks getting upset with that.
Better would be like Ingo said, see if we can toss sync_vrintime(), and
I've been playing with that...

I found something this morning, and as usual, the darn thing turned out
to be dirt simple.  With sync_vruntime() disabled, I found queues with
negative min_vruntime right from boot, and went hunting.  Adding some
instrumentation to set_task_cpu() (annoying consequences), I found the
below:

Legend:
vrun: tasks's vruntime
old: old queue's min_vruntime
new: new queue's min_vruntime
result: what's gonna happen

[   60.214508] kseriod vrun: 1427596999 old: 15070988657 new: 4065818654 res: 
-9577573004
[  218.274661] konqueror vrun: 342076210254 old: 658982966338 new: 219203403598 
res: -97703352486
[  218.284657] init vrun: 411638725179 old: 659187735208 new: 219203492472 res: 
-28345517557
[...]

A task which hasn't run in long enough for queues to have digressed
further than it's vruntime is going to end up with a negative vruntime.
Looking at place_entity(), it looks like it's supposed to fix that up,
but due to the order of arguments passed to max_vrintime(), and the
unsigned comparison therein, it won't.

Running with the patchlet below, my box _so far_ has not become
terminally unhappy despite spread0.  I'm writing this with the pinned
hogs test running right now, and all is well, so I _think_ it might be
ok to just remove sync_vruntime() after all.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c 
linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c   2007-09-23 
14:48:18.0 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c  2007-09-24 11:02:05.0 
+0200
@@ -117,7 +117,7 @@ static inline struct task_struct *task_o
 static inline u64
 max_vruntime(u64 min_vruntime, u64 vruntime)
 {
-   if ((vruntime > min_vruntime) ||
+   if (((s64)vruntime > (s64)min_vruntime) ||
(min_vruntime > (1ULL << 61) && vruntime < (1ULL << 50)))
min_vruntime = vruntime;
 
@@ -310,7 +310,7 @@ static void update_curr(struct cfs_rq *c
unsigned long delta_exec;
 
if (unlikely(!cfs_rq->nr_running))
-   return sync_vruntime(cfs_rq);
+   return ;//sync_vruntime(cfs_rq);
if (unlikely(!curr))
return;
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Tong Li

On Sun, 23 Sep 2007, Mike Galbraith wrote:


On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:

On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:

Mike,

Could you try this patch to see if it solves the latency problem?


No, but it helps some when running two un-pinned busy loops, one at nice
0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
_seconds_ doing this, and logging sched_debug and /proc/`pidof
Xorg`/sched from SCHED_RR shells.


Looking at a log (snippet attached) from this morning with the last hunk
of your patch still intact, it looks like min_vruntime is being modified
after a task is queued.  If you look at the snippet, you'll see the nice
19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
second later on CPU1 with it's vruntime at 2814952.425082, but
min_vruntime is 3061874.838356.


I think this could be what was happening: between the two seconds, CPU 0 
becomes idle and it pulls the nice 19 task over via pull_task(), which 
calls set_task_cpu(), which changes the task's vruntime to the current 
min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 
calls activate_task(), which calls enqueue_task() and in turn 
update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets 
called and CPU 0's min_vruntime gets synced to the system max. Thus, the 
nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think 
this can be fixed by adding the following code in set_task_cpu() before we 
adjust p->vruntime:


if (!new_rq->cfs.nr_running)
sync_vruntime(new_rq);


static void sync_vruntime(struct cfs_rq *cfs_rq)
{
struct rq *rq;
-   int cpu;
+   int cpu, wrote = 0;

for_each_online_cpu(cpu) {
rq = _cpu(runqueues, cpu);
+   if (spin_is_locked(>lock))
+   continue;
+   smp_wmb();
cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
rq->cfs.min_vruntime);
+   wrote++;
}
-   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+   if (wrote)
+   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
}


I think this rq->lock check works because it prevents the above scenario 
(CPU 0 is in pull_task so it must hold the rq lock). But my concern is 
that it may be too conservative, since sync_vruntime is called by 
update_curr, which mostly gets called in enqueue_task() and 
dequeue_task(), both of which are often invoked with the rq lock being 
held. Thus, if we don't allow sync_vruntime when rq lock is held, the sync 
will occur much less frequently and thus weaken cross-CPU fairness.


  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Tong Li

On Sun, 23 Sep 2007, Mike Galbraith wrote:


On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:

On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:

Mike,

Could you try this patch to see if it solves the latency problem?


No, but it helps some when running two un-pinned busy loops, one at nice
0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
_seconds_ doing this, and logging sched_debug and /proc/`pidof
Xorg`/sched from SCHED_RR shells.


Looking at a log (snippet attached) from this morning with the last hunk
of your patch still intact, it looks like min_vruntime is being modified
after a task is queued.  If you look at the snippet, you'll see the nice
19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
second later on CPU1 with it's vruntime at 2814952.425082, but
min_vruntime is 3061874.838356.


I think this could be what was happening: between the two seconds, CPU 0 
becomes idle and it pulls the nice 19 task over via pull_task(), which 
calls set_task_cpu(), which changes the task's vruntime to the current 
min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 
calls activate_task(), which calls enqueue_task() and in turn 
update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets 
called and CPU 0's min_vruntime gets synced to the system max. Thus, the 
nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think 
this can be fixed by adding the following code in set_task_cpu() before we 
adjust p-vruntime:


if (!new_rq-cfs.nr_running)
sync_vruntime(new_rq);


static void sync_vruntime(struct cfs_rq *cfs_rq)
{
struct rq *rq;
-   int cpu;
+   int cpu, wrote = 0;

for_each_online_cpu(cpu) {
rq = per_cpu(runqueues, cpu);
+   if (spin_is_locked(rq-lock))
+   continue;
+   smp_wmb();
cfs_rq-min_vruntime = max_vruntime(cfs_rq-min_vruntime,
rq-cfs.min_vruntime);
+   wrote++;
}
-   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+   if (wrote)
+   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
}


I think this rq-lock check works because it prevents the above scenario 
(CPU 0 is in pull_task so it must hold the rq lock). But my concern is 
that it may be too conservative, since sync_vruntime is called by 
update_curr, which mostly gets called in enqueue_task() and 
dequeue_task(), both of which are often invoked with the rq lock being 
held. Thus, if we don't allow sync_vruntime when rq lock is held, the sync 
will occur much less frequently and thus weaken cross-CPU fairness.


  tong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Mike Galbraith
On Sun, 2007-09-23 at 23:21 -0700, Tong Li wrote: 
 On Sun, 23 Sep 2007, Mike Galbraith wrote:
 
  On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
  On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
  Mike,
 
  Could you try this patch to see if it solves the latency problem?
 
  No, but it helps some when running two un-pinned busy loops, one at nice
  0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
  _seconds_ doing this, and logging sched_debug and /proc/`pidof
  Xorg`/sched from SCHED_RR shells.
 
  Looking at a log (snippet attached) from this morning with the last hunk
  of your patch still intact, it looks like min_vruntime is being modified
  after a task is queued.  If you look at the snippet, you'll see the nice
  19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
  second later on CPU1 with it's vruntime at 2814952.425082, but
  min_vruntime is 3061874.838356.
 
 I think this could be what was happening: between the two seconds, CPU 0 
 becomes idle and it pulls the nice 19 task over via pull_task(), which 
 calls set_task_cpu(), which changes the task's vruntime to the current 
 min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 
 calls activate_task(), which calls enqueue_task() and in turn 
 update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets 
 called and CPU 0's min_vruntime gets synced to the system max. Thus, the 
 nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think 
 this can be fixed by adding the following code in set_task_cpu() before we 
 adjust p-vruntime:
 
 if (!new_rq-cfs.nr_running)
   sync_vruntime(new_rq);

Hmm.  I can imagine Mondo-Boxen-R-Us folks getting upset with that.
Better would be like Ingo said, see if we can toss sync_vrintime(), and
I've been playing with that...

I found something this morning, and as usual, the darn thing turned out
to be dirt simple.  With sync_vruntime() disabled, I found queues with
negative min_vruntime right from boot, and went hunting.  Adding some
instrumentation to set_task_cpu() (annoying consequences), I found the
below:

Legend:
vrun: tasks's vruntime
old: old queue's min_vruntime
new: new queue's min_vruntime
result: what's gonna happen

[   60.214508] kseriod vrun: 1427596999 old: 15070988657 new: 4065818654 res: 
-9577573004
[  218.274661] konqueror vrun: 342076210254 old: 658982966338 new: 219203403598 
res: -97703352486
[  218.284657] init vrun: 411638725179 old: 659187735208 new: 219203492472 res: 
-28345517557
[...]

A task which hasn't run in long enough for queues to have digressed
further than it's vruntime is going to end up with a negative vruntime.
Looking at place_entity(), it looks like it's supposed to fix that up,
but due to the order of arguments passed to max_vrintime(), and the
unsigned comparison therein, it won't.

Running with the patchlet below, my box _so far_ has not become
terminally unhappy despite spread0.  I'm writing this with the pinned
hogs test running right now, and all is well, so I _think_ it might be
ok to just remove sync_vruntime() after all.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c 
linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c   2007-09-23 
14:48:18.0 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c  2007-09-24 11:02:05.0 
+0200
@@ -117,7 +117,7 @@ static inline struct task_struct *task_o
 static inline u64
 max_vruntime(u64 min_vruntime, u64 vruntime)
 {
-   if ((vruntime  min_vruntime) ||
+   if (((s64)vruntime  (s64)min_vruntime) ||
(min_vruntime  (1ULL  61)  vruntime  (1ULL  50)))
min_vruntime = vruntime;
 
@@ -310,7 +310,7 @@ static void update_curr(struct cfs_rq *c
unsigned long delta_exec;
 
if (unlikely(!cfs_rq-nr_running))
-   return sync_vruntime(cfs_rq);
+   return ;//sync_vruntime(cfs_rq);
if (unlikely(!curr))
return;
 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Peter Zijlstra
On Mon, 24 Sep 2007 12:10:09 +0200 Mike Galbraith [EMAIL PROTECTED] wrote:


 @@ -117,7 +117,7 @@ static inline struct task_struct *task_o
  static inline u64
  max_vruntime(u64 min_vruntime, u64 vruntime)
  {
 - if ((vruntime  min_vruntime) ||
 + if (((s64)vruntime  (s64)min_vruntime) ||
   (min_vruntime  (1ULL  61)  vruntime  (1ULL  50)))
   min_vruntime = vruntime;
  

how about something like:

 s64 delta = (s64)(vruntime - min_vruntime);
 if (delta  0)
  min_vruntime += delta;

That would rid us of most of the funny conditionals there.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Mike Galbraith
On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:

 how about something like:
 
  s64 delta = (s64)(vruntime - min_vruntime);
  if (delta  0)
   min_vruntime += delta;
 
 That would rid us of most of the funny conditionals there.

That still left me with negative min_vruntimes.  The pinned hogs didn't
lock my box up, but I quickly got the below, so hastily killed it.

se.wait_max  : 7.846949
se.wait_max  :   301.951601
se.wait_max  : 7.071359

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Peter Zijlstra
On Mon, 24 Sep 2007 12:42:15 +0200 Mike Galbraith [EMAIL PROTECTED] wrote:

 On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
 
  how about something like:
  
   s64 delta = (s64)(vruntime - min_vruntime);
   if (delta  0)
min_vruntime += delta;
  
  That would rid us of most of the funny conditionals there.
 
 That still left me with negative min_vruntimes.  The pinned hogs didn't
 lock my box up, but I quickly got the below, so hastily killed it.
 
 se.wait_max  : 7.846949
 se.wait_max  :   301.951601
 se.wait_max  : 7.071359


Odd, the idea (which I think is clear) is that min_vruntime can wrap
around the u64 spectrum. And by using min_vruntime as offset to base
the key around, we get a signed but limited range key-space. (because
we update min_vruntime to be the leftmost task (in a monotonic fashion))

So I'm having trouble with these patches, that is, both your wrap
around condition of:

  if (likely(new_rq-cfs.min_vruntime))

as well as the last patchlet:

  if (((s64)vruntime  (s64)min_vruntime) ||

in that neither of these changes make sense in what its trying to do.

Its perfectly valid for min_vruntime to exist in 1ULL  63.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Mike Galbraith
On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
 On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
 
  how about something like:
  
   s64 delta = (s64)(vruntime - min_vruntime);
   if (delta  0)
min_vruntime += delta;
  
  That would rid us of most of the funny conditionals there.
 
 That still left me with negative min_vruntimes.  The pinned hogs didn't
 lock my box up, but I quickly got the below, so hastily killed it.

Shouldn't the max() in place_entity() be the max_vruntime() that my
lysdexia told me it was when I looked at it earlier? ;-)

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Mike Galbraith
On Mon, 2007-09-24 at 13:08 +0200, Peter Zijlstra wrote:

 Its perfectly valid for min_vruntime to exist in 1ULL  63.

But wrap backward timewarp is what's killing my box.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Peter Zijlstra
On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith [EMAIL PROTECTED] wrote:

 On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
  On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
  
   how about something like:
   
s64 delta = (s64)(vruntime - min_vruntime);
if (delta  0)
 min_vruntime += delta;
   
   That would rid us of most of the funny conditionals there.
  
  That still left me with negative min_vruntimes.  The pinned hogs didn't
  lock my box up, but I quickly got the below, so hastily killed it.
 
 Shouldn't the max() in place_entity() be the max_vruntime() that my
 lysdexia told me it was when I looked at it earlier? ;-)
 

probably, my tree doesn't have that max anymore so I'm not sure.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-24 Thread Tong Li

On Mon, 24 Sep 2007, Peter Zijlstra wrote:


On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith [EMAIL PROTECTED] wrote:


On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:

On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:


how about something like:

 s64 delta = (s64)(vruntime - min_vruntime);
 if (delta  0)
  min_vruntime += delta;

That would rid us of most of the funny conditionals there.


That still left me with negative min_vruntimes.  The pinned hogs didn't
lock my box up, but I quickly got the below, so hastily killed it.


Shouldn't the max() in place_entity() be the max_vruntime() that my
lysdexia told me it was when I looked at it earlier? ;-)



probably, my tree doesn't have that max anymore so I'm not sure.



Last time I looked, I thought the max() in place_entity() was fine and the 
problem seemed to be in set_task_cpu() that was causing the negative 
vruntimes:


if (likely(new_rq-cfs.min_vruntime))
p-se.vruntime -= old_rq-cfs.min_vruntime -
new_rq-cfs.min_vruntime;

I think it's fine we get rid of sync_vruntime(). I need to think more 
about how to make global fairness work based on this--it seems to be more 
complicated than if we had sync_vruntime().


  tong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-23 Thread Mike Galbraith
Hi,

I haven't chased down the exact scenario, but using a min_vruntime which
is about to change definitely seems to be what's causing my latency
woes.  Does the below cure your fairness woes as well?

(first bit is just some debug format changes for your convenience if you
try it)

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_debug.c 
linux-2.6.23-rc7.d/kernel/sched_debug.c
--- git/linux-2.6.sched-devel/kernel/sched_debug.c  2007-09-22 
13:37:32.0 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_debug.c 2007-09-21 12:12:27.0 
+0200
@@ -67,7 +67,7 @@ print_task(struct seq_file *m, struct rq
(long long)(p->nvcsw + p->nivcsw),
p->prio);
 #ifdef CONFIG_SCHEDSTATS
-   SEQ_printf(m, "%15Ld.%06ld %15Ld.%06ld %15Ld.%06ld\n",
+   SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld\n",
SPLIT_NS(p->se.vruntime),
SPLIT_NS(p->se.sum_exec_runtime),
SPLIT_NS(p->se.sum_sleep_runtime));
@@ -83,10 +83,10 @@ static void print_rq(struct seq_file *m,
 
SEQ_printf(m,
"\nrunnable tasks:\n"
-   "task   PIDtree-key  switches  prio"
-   "exec-runtimesum-exec   sum-sleep\n"
+   "task   PID tree-key  switches  prio"
+   " exec-runtime sum-execsum-sleep\n"
"--"
-   "");
+   "\n");
 
read_lock_irq(_lock);
 
@@ -134,7 +134,7 @@ void print_cfs_rq(struct seq_file *m, in
spread0 = min_vruntime - rq0_min_vruntime;
SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
SPLIT_NS(spread0));
-   SEQ_printf(m, "  .%-30s: %ld\n", "spread0",
+   SEQ_printf(m, "  .%-30s: %ld\n", "nr_sync_min_vruntime",
cfs_rq->nr_sync_min_vruntime);
 }
 
diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c 
linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c   2007-09-22 
13:37:32.0 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c  2007-09-23 12:53:11.0 
+0200
@@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str
 static void sync_vruntime(struct cfs_rq *cfs_rq)
 {
struct rq *rq;
-   int cpu;
+   int cpu, wrote = 0;
 
for_each_online_cpu(cpu) {
rq = _cpu(runqueues, cpu);
+   if (!spin_trylock(>lock))
+   continue;
cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
rq->cfs.min_vruntime);
+   spin_unlock(>lock);
+   wrote++;
}
-   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+   if (wrote)
+   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)->clock;
unsigned long delta_exec;
 
-   if (unlikely(!curr))
+   if (unlikely(!cfs_rq->nr_running))
return sync_vruntime(cfs_rq);
+   if (unlikely(!curr))
+   return;
 
/*
 * Get the amount of time the current task was running


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-23 Thread Mike Galbraith
On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
> On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
> > Mike,
> > 
> > Could you try this patch to see if it solves the latency problem?
> 
> No, but it helps some when running two un-pinned busy loops, one at nice
> 0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
> _seconds_ doing this, and logging sched_debug and /proc/`pidof
> Xorg`/sched from SCHED_RR shells.

Looking at a log (snippet attached) from this morning with the last hunk
of your patch still intact, it looks like min_vruntime is being modified
after a task is queued.  If you look at the snippet, you'll see the nice
19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
second later on CPU1 with it's vruntime at 2814952.425082, but
min_vruntime is 3061874.838356.

I took a hammer to it, and my latencies running this test went away.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c 
linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c   2007-09-22 
13:37:32.0 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c  2007-09-23 08:29:38.0 
+0200
@@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str
 static void sync_vruntime(struct cfs_rq *cfs_rq)
 {
struct rq *rq;
-   int cpu;
+   int cpu, wrote = 0;
 
for_each_online_cpu(cpu) {
rq = _cpu(runqueues, cpu);
+   if (spin_is_locked(>lock))
+   continue;
+   smp_wmb();
cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
rq->cfs.min_vruntime);
+   wrote++;
}
-   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+   if (wrote)
+   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)->clock;
unsigned long delta_exec;
 
-   if (unlikely(!curr))
+   if (unlikely(!cfs_rq->nr_running))
return sync_vruntime(cfs_rq);
+   if (unlikely(!curr))
+   return;
 
/*
 * Get the amount of time the current task was running

Sched Debug Version: v0.05-v20, 2.6.23-rc7-smp-d #57
now at 682203.468012 msecs

cpu#0, 2992.611 MHz
  .nr_running: 1
  .load  : 1024
  .nr_switches   : 196616
  .nr_load_updates   : 118674
  .nr_uninterruptible: 4294966769
  .jiffies   : 382203
  .next_balance  : 0.382295
  .curr->pid : 6322
  .clock : 118656.313628
  .idle_clock: 0.00
  .prev_clock_raw: 705696.529916
  .clock_warps   : 0
  .clock_overflows   : 171665
  .clock_deep_idle_events: 0
  .clock_max_delta   : 0.999848
  .cpu_load[0]   : 178561
  .cpu_load[1]   : 134173
  .cpu_load[2]   : 78694
  .cpu_load[3]   : 42634
  .cpu_load[4]   : 22523

cfs_rq
  .exec_clock: 71557.235360
  .MIN_vruntime  : 0.01
  .min_vruntime  : 2798832.982741
  .max_vruntime  : 0.01
  .spread: 0.00
  .spread0   : 0.00
  .nr_sync_min_vruntime  : 76120

runnable tasks:
task   PID tree-key  switches  prio exec-runtime
 sum-execsum-sleep
--
R   bash  6322   2798832.982741   603   120   2798832.982741  
4423.956624 18419.152603

cpu#1, 2992.611 MHz
  .nr_running: 2
  .load  : 177537
  .nr_switches   : 154505
  .nr_load_updates   : 121679
  .nr_uninterruptible: 527
  .jiffies   : 382203
  .next_balance  : 0.382234
  .curr->pid : 6633
  .clock : 121660.572179
  .idle_clock: 0.00
  .prev_clock_raw: 705696.537576
  .clock_warps   : 0
  .clock_overflows   : 127876
  .clock_deep_idle_events: 0
  .clock_max_delta   : 0.999848
  .cpu_load[0]   : 177537
  .cpu_load[1]   : 155347
  .cpu_load[2]   : 102646
  .cpu_load[3]   : 58613
  .cpu_load[4]   : 31265

cfs_rq
  .exec_clock: 32989.995528
  .MIN_vruntime  : 3010385.345325
  .min_vruntime  : 3010385.345325
  .max_vruntime  : 3010385.345325
  .spread

Re: [git] CFS-devel, group scheduler, fixes

2007-09-23 Thread Mike Galbraith
On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
 On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
  Mike,
  
  Could you try this patch to see if it solves the latency problem?
 
 No, but it helps some when running two un-pinned busy loops, one at nice
 0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
 _seconds_ doing this, and logging sched_debug and /proc/`pidof
 Xorg`/sched from SCHED_RR shells.

Looking at a log (snippet attached) from this morning with the last hunk
of your patch still intact, it looks like min_vruntime is being modified
after a task is queued.  If you look at the snippet, you'll see the nice
19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
second later on CPU1 with it's vruntime at 2814952.425082, but
min_vruntime is 3061874.838356.

I took a hammer to it, and my latencies running this test went away.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c 
linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c   2007-09-22 
13:37:32.0 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c  2007-09-23 08:29:38.0 
+0200
@@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str
 static void sync_vruntime(struct cfs_rq *cfs_rq)
 {
struct rq *rq;
-   int cpu;
+   int cpu, wrote = 0;
 
for_each_online_cpu(cpu) {
rq = per_cpu(runqueues, cpu);
+   if (spin_is_locked(rq-lock))
+   continue;
+   smp_wmb();
cfs_rq-min_vruntime = max_vruntime(cfs_rq-min_vruntime,
rq-cfs.min_vruntime);
+   wrote++;
}
-   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+   if (wrote)
+   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)-clock;
unsigned long delta_exec;
 
-   if (unlikely(!curr))
+   if (unlikely(!cfs_rq-nr_running))
return sync_vruntime(cfs_rq);
+   if (unlikely(!curr))
+   return;
 
/*
 * Get the amount of time the current task was running

Sched Debug Version: v0.05-v20, 2.6.23-rc7-smp-d #57
now at 682203.468012 msecs

cpu#0, 2992.611 MHz
  .nr_running: 1
  .load  : 1024
  .nr_switches   : 196616
  .nr_load_updates   : 118674
  .nr_uninterruptible: 4294966769
  .jiffies   : 382203
  .next_balance  : 0.382295
  .curr-pid : 6322
  .clock : 118656.313628
  .idle_clock: 0.00
  .prev_clock_raw: 705696.529916
  .clock_warps   : 0
  .clock_overflows   : 171665
  .clock_deep_idle_events: 0
  .clock_max_delta   : 0.999848
  .cpu_load[0]   : 178561
  .cpu_load[1]   : 134173
  .cpu_load[2]   : 78694
  .cpu_load[3]   : 42634
  .cpu_load[4]   : 22523

cfs_rq
  .exec_clock: 71557.235360
  .MIN_vruntime  : 0.01
  .min_vruntime  : 2798832.982741
  .max_vruntime  : 0.01
  .spread: 0.00
  .spread0   : 0.00
  .nr_sync_min_vruntime  : 76120

runnable tasks:
task   PID tree-key  switches  prio exec-runtime
 sum-execsum-sleep
--
R   bash  6322   2798832.982741   603   120   2798832.982741  
4423.956624 18419.152603

cpu#1, 2992.611 MHz
  .nr_running: 2
  .load  : 177537
  .nr_switches   : 154505
  .nr_load_updates   : 121679
  .nr_uninterruptible: 527
  .jiffies   : 382203
  .next_balance  : 0.382234
  .curr-pid : 6633
  .clock : 121660.572179
  .idle_clock: 0.00
  .prev_clock_raw: 705696.537576
  .clock_warps   : 0
  .clock_overflows   : 127876
  .clock_deep_idle_events: 0
  .clock_max_delta   : 0.999848
  .cpu_load[0]   : 177537
  .cpu_load[1]   : 155347
  .cpu_load[2]   : 102646
  .cpu_load[3]   : 58613
  .cpu_load[4]   : 31265

cfs_rq
  .exec_clock: 32989.995528
  .MIN_vruntime  : 3010385.345325
  .min_vruntime  : 3010385.345325
  .max_vruntime  : 3010385.345325
  .spread: 0.00
  

Re: [git] CFS-devel, group scheduler, fixes

2007-09-23 Thread Mike Galbraith
Hi,

I haven't chased down the exact scenario, but using a min_vruntime which
is about to change definitely seems to be what's causing my latency
woes.  Does the below cure your fairness woes as well?

(first bit is just some debug format changes for your convenience if you
try it)

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_debug.c 
linux-2.6.23-rc7.d/kernel/sched_debug.c
--- git/linux-2.6.sched-devel/kernel/sched_debug.c  2007-09-22 
13:37:32.0 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_debug.c 2007-09-21 12:12:27.0 
+0200
@@ -67,7 +67,7 @@ print_task(struct seq_file *m, struct rq
(long long)(p-nvcsw + p-nivcsw),
p-prio);
 #ifdef CONFIG_SCHEDSTATS
-   SEQ_printf(m, %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld\n,
+   SEQ_printf(m, %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld\n,
SPLIT_NS(p-se.vruntime),
SPLIT_NS(p-se.sum_exec_runtime),
SPLIT_NS(p-se.sum_sleep_runtime));
@@ -83,10 +83,10 @@ static void print_rq(struct seq_file *m,
 
SEQ_printf(m,
\nrunnable tasks:\n
-   task   PIDtree-key  switches  prio
-   exec-runtimesum-exec   sum-sleep\n
+   task   PID tree-key  switches  prio
+exec-runtime sum-execsum-sleep\n
--
-   );
+   \n);
 
read_lock_irq(tasklist_lock);
 
@@ -134,7 +134,7 @@ void print_cfs_rq(struct seq_file *m, in
spread0 = min_vruntime - rq0_min_vruntime;
SEQ_printf(m,   .%-30s: %Ld.%06ld\n, spread0,
SPLIT_NS(spread0));
-   SEQ_printf(m,   .%-30s: %ld\n, spread0,
+   SEQ_printf(m,   .%-30s: %ld\n, nr_sync_min_vruntime,
cfs_rq-nr_sync_min_vruntime);
 }
 
diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c 
linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c   2007-09-22 
13:37:32.0 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c  2007-09-23 12:53:11.0 
+0200
@@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str
 static void sync_vruntime(struct cfs_rq *cfs_rq)
 {
struct rq *rq;
-   int cpu;
+   int cpu, wrote = 0;
 
for_each_online_cpu(cpu) {
rq = per_cpu(runqueues, cpu);
+   if (!spin_trylock(rq-lock))
+   continue;
cfs_rq-min_vruntime = max_vruntime(cfs_rq-min_vruntime,
rq-cfs.min_vruntime);
+   spin_unlock(rq-lock);
+   wrote++;
}
-   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+   if (wrote)
+   schedstat_inc(cfs_rq, nr_sync_min_vruntime);
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)-clock;
unsigned long delta_exec;
 
-   if (unlikely(!curr))
+   if (unlikely(!cfs_rq-nr_running))
return sync_vruntime(cfs_rq);
+   if (unlikely(!curr))
+   return;
 
/*
 * Get the amount of time the current task was running


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-22 Thread Mike Galbraith
On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
> Mike,
> 
> Could you try this patch to see if it solves the latency problem?

No, but it helps some when running two un-pinned busy loops, one at nice
0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
_seconds_ doing this, and logging sched_debug and /proc/`pidof
Xorg`/sched from SCHED_RR shells.

se.wait_max  :   164.242748
se.wait_max  :   121.996128
se.wait_max  :   194.464773
se.wait_max  :   517.425411
se.wait_max  :   131.453214
se.wait_max  :   122.984190
se.wait_max  :   111.729274
se.wait_max  :   119.567580
se.wait_max  :   126.980696
se.wait_max  :   177.241452
se.wait_max  :   129.604329
se.wait_max  :   119.631657

It doesn't help at all with this oddity while running same:

[EMAIL PROTECTED]: now is
the time
fr
 aall good men to come to the aid of their country

That was a nice 0 shell window.  I'm not a great typist, but I ain't
_that_ bad :)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-22 Thread Mike Galbraith
On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
 Mike,
 
 Could you try this patch to see if it solves the latency problem?

No, but it helps some when running two un-pinned busy loops, one at nice
0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
_seconds_ doing this, and logging sched_debug and /proc/`pidof
Xorg`/sched from SCHED_RR shells.

se.wait_max  :   164.242748
se.wait_max  :   121.996128
se.wait_max  :   194.464773
se.wait_max  :   517.425411
se.wait_max  :   131.453214
se.wait_max  :   122.984190
se.wait_max  :   111.729274
se.wait_max  :   119.567580
se.wait_max  :   126.980696
se.wait_max  :   177.241452
se.wait_max  :   129.604329
se.wait_max  :   119.631657

It doesn't help at all with this oddity while running same:

[EMAIL PROTECTED]: now is
the time
fr
 aall good men to come to the aid of their country

That was a nice 0 shell window.  I'm not a great typist, but I ain't
_that_ bad :)

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-21 Thread Tong Li

Mike,

Could you try this patch to see if it solves the latency problem?

  tong

Changes:

1. Modified vruntime adjustment logic in set_task_cpu(). See comments in 
code. This fixed the negative vruntime problem.


2. This code in update_curr() seems to be wrong:

if (unlikely(!curr))
return sync_vruntime(cfs_rq);

cfs_rq->curr can be NULL even if cfs_rq->nr_running is non-zero (e.g., 
when an RT task is running). We only want to call sync_vruntime when 
cfs_rq->nr_running is 0. This fixed the large latency problem (at least in 
my tests).


Signed-off-by: Tong Li <[EMAIL PROTECTED]>
---
diff -uprN linux-2.6-sched-devel-orig/kernel/sched.c 
linux-2.6-sched-devel/kernel/sched.c
--- linux-2.6-sched-devel-orig/kernel/sched.c   2007-09-20 12:15:41.0 
-0700
+++ linux-2.6-sched-devel/kernel/sched.c2007-09-21 19:40:08.0 
-0700
@@ -1033,9 +1033,20 @@ void set_task_cpu(struct task_struct *p,
if (p->se.block_start)
p->se.block_start -= clock_offset;
 #endif
-   if (likely(new_rq->cfs.min_vruntime))
-   p->se.vruntime -= old_rq->cfs.min_vruntime -
-   new_rq->cfs.min_vruntime;
+	/* 
+	 * Reset p's vruntime if it moves to new_cpu whose min_vruntime is 
+	 * 100,000,000 (equivalent to 100ms for nice-0 tasks) larger or

+* smaller than p's vruntime. This improves interactivity when
+* pinned and unpinned tasks co-exist. For example, pinning a few
+* tasks to a CPU can cause its min_vruntime much smaller than the
+* other CPUs. If a task moves to this CPU, its vruntime can be so
+* large it won't be scheduled until the locally pinned tasks'
+* vruntimes catch up, causing large delays.
+*/
+	if (unlikely(old_cpu != new_cpu && p->se.vruntime && 
+		(p->se.vruntime > new_rq->cfs.min_vruntime + 1 || 
+		 p->se.vruntime + 1 < new_rq->cfs.min_vruntime)))

+   p->se.vruntime = new_rq->cfs.min_vruntime;

__set_task_cpu(p, new_cpu);
 }
@@ -1599,6 +1610,7 @@ static void __sched_fork(struct task_str
p->se.exec_start = 0;
p->se.sum_exec_runtime   = 0;
p->se.prev_sum_exec_runtime  = 0;
+   p->se.vruntime   = 0;

 #ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
diff -uprN linux-2.6-sched-devel-orig/kernel/sched_fair.c 
linux-2.6-sched-devel/kernel/sched_fair.c
--- linux-2.6-sched-devel-orig/kernel/sched_fair.c  2007-09-20 
12:15:41.0 -0700
+++ linux-2.6-sched-devel/kernel/sched_fair.c   2007-09-21 17:23:09.0 
-0700
@@ -306,8 +306,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)->clock;
unsigned long delta_exec;

-   if (unlikely(!curr))
+   if (unlikely(!cfs_rq->nr_running))
return sync_vruntime(cfs_rq);
+   if (unlikely(!curr))
+   return;

/*
 * Get the amount of time the current task was running
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-21 Thread Tong Li

Mike,

Could you try this patch to see if it solves the latency problem?

  tong

Changes:

1. Modified vruntime adjustment logic in set_task_cpu(). See comments in 
code. This fixed the negative vruntime problem.


2. This code in update_curr() seems to be wrong:

if (unlikely(!curr))
return sync_vruntime(cfs_rq);

cfs_rq-curr can be NULL even if cfs_rq-nr_running is non-zero (e.g., 
when an RT task is running). We only want to call sync_vruntime when 
cfs_rq-nr_running is 0. This fixed the large latency problem (at least in 
my tests).


Signed-off-by: Tong Li [EMAIL PROTECTED]
---
diff -uprN linux-2.6-sched-devel-orig/kernel/sched.c 
linux-2.6-sched-devel/kernel/sched.c
--- linux-2.6-sched-devel-orig/kernel/sched.c   2007-09-20 12:15:41.0 
-0700
+++ linux-2.6-sched-devel/kernel/sched.c2007-09-21 19:40:08.0 
-0700
@@ -1033,9 +1033,20 @@ void set_task_cpu(struct task_struct *p,
if (p-se.block_start)
p-se.block_start -= clock_offset;
 #endif
-   if (likely(new_rq-cfs.min_vruntime))
-   p-se.vruntime -= old_rq-cfs.min_vruntime -
-   new_rq-cfs.min_vruntime;
+	/* 
+	 * Reset p's vruntime if it moves to new_cpu whose min_vruntime is 
+	 * 100,000,000 (equivalent to 100ms for nice-0 tasks) larger or

+* smaller than p's vruntime. This improves interactivity when
+* pinned and unpinned tasks co-exist. For example, pinning a few
+* tasks to a CPU can cause its min_vruntime much smaller than the
+* other CPUs. If a task moves to this CPU, its vruntime can be so
+* large it won't be scheduled until the locally pinned tasks'
+* vruntimes catch up, causing large delays.
+*/
+	if (unlikely(old_cpu != new_cpu  p-se.vruntime  
+		(p-se.vruntime  new_rq-cfs.min_vruntime + 1 || 
+		 p-se.vruntime + 1  new_rq-cfs.min_vruntime)))

+   p-se.vruntime = new_rq-cfs.min_vruntime;

__set_task_cpu(p, new_cpu);
 }
@@ -1599,6 +1610,7 @@ static void __sched_fork(struct task_str
p-se.exec_start = 0;
p-se.sum_exec_runtime   = 0;
p-se.prev_sum_exec_runtime  = 0;
+   p-se.vruntime   = 0;

 #ifdef CONFIG_SCHEDSTATS
p-se.wait_start = 0;
diff -uprN linux-2.6-sched-devel-orig/kernel/sched_fair.c 
linux-2.6-sched-devel/kernel/sched_fair.c
--- linux-2.6-sched-devel-orig/kernel/sched_fair.c  2007-09-20 
12:15:41.0 -0700
+++ linux-2.6-sched-devel/kernel/sched_fair.c   2007-09-21 17:23:09.0 
-0700
@@ -306,8 +306,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)-clock;
unsigned long delta_exec;

-   if (unlikely(!curr))
+   if (unlikely(!cfs_rq-nr_running))
return sync_vruntime(cfs_rq);
+   if (unlikely(!curr))
+   return;

/*
 * Get the amount of time the current task was running
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Willy Tarreau
On Fri, Sep 21, 2007 at 04:40:55AM +0200, Mike Galbraith wrote:
> On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote:
> 
> > I don't know if this is relevant, but 4294966399 in nr_uninterruptible
> > for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
> > I don't know if this rings a bell for someone or if it's a completely
> > useless comment, but just in case...
> 
> A task can block on one cpu, and wake up on another, which isn't
> tracked, hence the fishy looking numbers.  The true nr_uninterruptible
> is the sum of all.

OK, thanks Mike for this clarification.

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Mike Galbraith
On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote:

> I don't know if this is relevant, but 4294966399 in nr_uninterruptible
> for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
> I don't know if this rings a bell for someone or if it's a completely
> useless comment, but just in case...

A task can block on one cpu, and wake up on another, which isn't
tracked, hence the fishy looking numbers.  The true nr_uninterruptible
is the sum of all.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Willy Tarreau
On Thu, Sep 20, 2007 at 09:15:07AM +0200, Mike Galbraith wrote:
> But, I did just manage to trigger some horrid behavior, and log it.  I
> modified the kernel to print task's actual tree key instead of their
> current vruntime, and was watching that while make -j2 was running (and
> not seeing anything very interesting), when on a lark, I restarted
> SuSE's system updater thingy.  That beast chews 100% CPU for so long at
> startup that I long ago got annoyed, and changed it to run at nice 19.
> Anyway, when it started, interactivity went to hell in the proverbial
> hand-basket, and the sched_debug log shows some interesting results..
> like spread0 hitting -13659412644, and cc1 being keyed at -3867063305.
> 
> cpu#0, 2992.608 MHz
>   .nr_running: 4
>   .load  : 4096
>   .nr_switches   : 1105882
>   .nr_load_updates   : 735146
>   .nr_uninterruptible: 4294966399

(...)

> cpu#1, 2992.608 MHz
>   .nr_running: 5
>   .load  : 6208
>   .nr_switches   : 1012995
>   .nr_load_updates   : 747540
>   .nr_uninterruptible: 897

I don't know if this is relevant, but 4294966399 in nr_uninterruptible
for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
I don't know if this rings a bell for someone or if it's a completely
useless comment, but just in case...

HTH,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Mike Galbraith
On Thu, 2007-09-20 at 09:51 +0200, Ingo Molnar wrote:
> * Mike Galbraith <[EMAIL PROTECTED]> wrote:
> 
> > [...]  I modified the kernel to print task's actual tree key instead 
> > of their current vruntime, [...]
> 
> btw., that looks like a debug printout bug in sched-devel.git - could 
> you send me your fix? I've pushed out the latest sched-devel (ontop of 
> -rc7) to the usual place:

Well, I temporarily added a key field, which is only used to store the
key for debug...

>git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
> 
> (note that there are some debug printout updates which might clash with 
> your fix)

I'll pull/build this and test my migration problem there.  All I have to
do is to add a nice 19 chew-max to my make -j2, and all hell breaks
loose.  Always suspecting myself first in the search for dirty rotten
SOB who broke local scheduler :) I nuked the migration fix.  Looks like
I'm not the SOB this time.. it got much worse, with max Xorg latency of
>500ms.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Ingo Molnar

* Mike Galbraith <[EMAIL PROTECTED]> wrote:

> [...]  I modified the kernel to print task's actual tree key instead 
> of their current vruntime, [...]

btw., that looks like a debug printout bug in sched-devel.git - could 
you send me your fix? I've pushed out the latest sched-devel (ontop of 
-rc7) to the usual place:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

(note that there are some debug printout updates which might clash with 
your fix)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Mike Galbraith
On Thu, 2007-09-20 at 06:55 +0200, Mike Galbraith wrote:
> On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote:
> 
> > Were the experiments run on a 2-CPU system?
> 
> Yes.
> 
> >  When Xorg experiences large 
> > wait time, is it on the same CPU that has the two pinned tasks? If this is 
> > the case, then the problem could be X somehow advanced faster and got a 
> > larger vruntime then the two pinned tasks, so it had to wait for the 
> > pinned ones to catch up before it got a chance to be scheduled.
> 
> Good question.  I've not yet been able to capture any event where
> vruntimes are that far apart in sched_debug.

But, I did just manage to trigger some horrid behavior, and log it.  I
modified the kernel to print task's actual tree key instead of their
current vruntime, and was watching that while make -j2 was running (and
not seeing anything very interesting), when on a lark, I restarted
SuSE's system updater thingy.  That beast chews 100% CPU for so long at
startup that I long ago got annoyed, and changed it to run at nice 19.
Anyway, when it started, interactivity went to hell in the proverbial
hand-basket, and the sched_debug log shows some interesting results..
like spread0 hitting -13659412644, and cc1 being keyed at -3867063305.

cpu#0, 2992.608 MHz
  .nr_running: 4
  .load  : 4096
  .nr_switches   : 1105882
  .nr_load_updates   : 735146
  .nr_uninterruptible: 4294966399
  .jiffies   : 1936201
  .next_balance  : 1936276
  .curr->pid : 27004
  .clock : 735034802877
  .idle_clock: 0
  .prev_clock_raw: 2259213083886
  .clock_warps   : 0
  .clock_overflows   : 677631
  .clock_deep_idle_events: 0
  .clock_max_delta   : 999848
  .cpu_load[0]   : 4096
  .cpu_load[1]   : 3960
  .cpu_load[2]   : 3814
  .cpu_load[3]   : 3539
  .cpu_load[4]   : 3153

cfs_rq
  .exec_clock: 623175870707
  .MIN_vruntime  : 3791173262297
  .min_vruntime  : 3791173259723
  .max_vruntime  : 3791173264579
  .spread: 2282
  .spread0   : 0
  .nr_sync_min_vruntime  : 296756

runnable tasks:
task   PIDtree-key  switches  prioexec-runtime
sum-exec   sum-sleep
--
bash  63383212   554   120   3791173262935  
 104271360645224098563
Rcc1 27004-7509103 6   120   3791165750620  
  1539602913300466
make 270064856 6   120   3791173264579  
   9540879 6458208
make 270102574 3   120   3791173262297  
   8897680   0

cpu#1, 2992.608 MHz
  .nr_running: 5
  .load  : 6208
  .nr_switches   : 1012995
  .nr_load_updates   : 747540
  .nr_uninterruptible: 897
  .jiffies   : 1936201
  .next_balance  : 1936303
  .curr->pid : 27012
  .clock : 747426624818
  .idle_clock: 0
  .prev_clock_raw: 2259213140042
  .clock_warps   : 0
  .clock_overflows   : 582091
  .clock_deep_idle_events: 0
  .clock_max_delta   : 999848
  .cpu_load[0]   : 6208
  .cpu_load[1]   : 4545
  .cpu_load[2]   : 3810
  .cpu_load[3]   : 3065
  .cpu_load[4]   : 2397

cfs_rq
  .exec_clock: 581940255731
  .MIN_vruntime  : 3791835890838
  .min_vruntime  : 3791854778322
  .max_vruntime  : 3791902652145
  .spread: 66761307
  .spread0   : 681518599
  .nr_sync_min_vruntime  : 256743

runnable tasks:
task   PIDtree-key  switches  prioexec-runtime
sum-exec   sum-sleep
--
Xorg  5740   -18887484157025   115   3791835890838  
  46597617638567698716
   update-status 2609348184396  1537   139   3791902652145  
5709334592  2622161813
  sh 27011-6808461 2   120   3791847969861  
   4162855 1457101
Rcat 27012   -19965525  

Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Mike Galbraith
On Thu, 2007-09-20 at 06:55 +0200, Mike Galbraith wrote:
 On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote:
 
  Were the experiments run on a 2-CPU system?
 
 Yes.
 
   When Xorg experiences large 
  wait time, is it on the same CPU that has the two pinned tasks? If this is 
  the case, then the problem could be X somehow advanced faster and got a 
  larger vruntime then the two pinned tasks, so it had to wait for the 
  pinned ones to catch up before it got a chance to be scheduled.
 
 Good question.  I've not yet been able to capture any event where
 vruntimes are that far apart in sched_debug.

But, I did just manage to trigger some horrid behavior, and log it.  I
modified the kernel to print task's actual tree key instead of their
current vruntime, and was watching that while make -j2 was running (and
not seeing anything very interesting), when on a lark, I restarted
SuSE's system updater thingy.  That beast chews 100% CPU for so long at
startup that I long ago got annoyed, and changed it to run at nice 19.
Anyway, when it started, interactivity went to hell in the proverbial
hand-basket, and the sched_debug log shows some interesting results..
like spread0 hitting -13659412644, and cc1 being keyed at -3867063305.

cpu#0, 2992.608 MHz
  .nr_running: 4
  .load  : 4096
  .nr_switches   : 1105882
  .nr_load_updates   : 735146
  .nr_uninterruptible: 4294966399
  .jiffies   : 1936201
  .next_balance  : 1936276
  .curr-pid : 27004
  .clock : 735034802877
  .idle_clock: 0
  .prev_clock_raw: 2259213083886
  .clock_warps   : 0
  .clock_overflows   : 677631
  .clock_deep_idle_events: 0
  .clock_max_delta   : 999848
  .cpu_load[0]   : 4096
  .cpu_load[1]   : 3960
  .cpu_load[2]   : 3814
  .cpu_load[3]   : 3539
  .cpu_load[4]   : 3153

cfs_rq
  .exec_clock: 623175870707
  .MIN_vruntime  : 3791173262297
  .min_vruntime  : 3791173259723
  .max_vruntime  : 3791173264579
  .spread: 2282
  .spread0   : 0
  .nr_sync_min_vruntime  : 296756

runnable tasks:
task   PIDtree-key  switches  prioexec-runtime
sum-exec   sum-sleep
--
bash  63383212   554   120   3791173262935  
 104271360645224098563
Rcc1 27004-7509103 6   120   3791165750620  
  1539602913300466
make 270064856 6   120   3791173264579  
   9540879 6458208
make 270102574 3   120   3791173262297  
   8897680   0

cpu#1, 2992.608 MHz
  .nr_running: 5
  .load  : 6208
  .nr_switches   : 1012995
  .nr_load_updates   : 747540
  .nr_uninterruptible: 897
  .jiffies   : 1936201
  .next_balance  : 1936303
  .curr-pid : 27012
  .clock : 747426624818
  .idle_clock: 0
  .prev_clock_raw: 2259213140042
  .clock_warps   : 0
  .clock_overflows   : 582091
  .clock_deep_idle_events: 0
  .clock_max_delta   : 999848
  .cpu_load[0]   : 6208
  .cpu_load[1]   : 4545
  .cpu_load[2]   : 3810
  .cpu_load[3]   : 3065
  .cpu_load[4]   : 2397

cfs_rq
  .exec_clock: 581940255731
  .MIN_vruntime  : 3791835890838
  .min_vruntime  : 3791854778322
  .max_vruntime  : 3791902652145
  .spread: 66761307
  .spread0   : 681518599
  .nr_sync_min_vruntime  : 256743

runnable tasks:
task   PIDtree-key  switches  prioexec-runtime
sum-exec   sum-sleep
--
Xorg  5740   -18887484157025   115   3791835890838  
  46597617638567698716
   update-status 2609348184396  1537   139   3791902652145  
5709334592  2622161813
  sh 27011-6808461 2   120   3791847969861  
   4162855 1457101
Rcat 27012   -19965525 3   120   

Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Ingo Molnar

* Mike Galbraith [EMAIL PROTECTED] wrote:

 [...]  I modified the kernel to print task's actual tree key instead 
 of their current vruntime, [...]

btw., that looks like a debug printout bug in sched-devel.git - could 
you send me your fix? I've pushed out the latest sched-devel (ontop of 
-rc7) to the usual place:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

(note that there are some debug printout updates which might clash with 
your fix)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Mike Galbraith
On Thu, 2007-09-20 at 09:51 +0200, Ingo Molnar wrote:
 * Mike Galbraith [EMAIL PROTECTED] wrote:
 
  [...]  I modified the kernel to print task's actual tree key instead 
  of their current vruntime, [...]
 
 btw., that looks like a debug printout bug in sched-devel.git - could 
 you send me your fix? I've pushed out the latest sched-devel (ontop of 
 -rc7) to the usual place:

Well, I temporarily added a key field, which is only used to store the
key for debug...

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
 
 (note that there are some debug printout updates which might clash with 
 your fix)

I'll pull/build this and test my migration problem there.  All I have to
do is to add a nice 19 chew-max to my make -j2, and all hell breaks
loose.  Always suspecting myself first in the search for dirty rotten
SOB who broke local scheduler :) I nuked the migration fix.  Looks like
I'm not the SOB this time.. it got much worse, with max Xorg latency of
500ms.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Willy Tarreau
On Thu, Sep 20, 2007 at 09:15:07AM +0200, Mike Galbraith wrote:
 But, I did just manage to trigger some horrid behavior, and log it.  I
 modified the kernel to print task's actual tree key instead of their
 current vruntime, and was watching that while make -j2 was running (and
 not seeing anything very interesting), when on a lark, I restarted
 SuSE's system updater thingy.  That beast chews 100% CPU for so long at
 startup that I long ago got annoyed, and changed it to run at nice 19.
 Anyway, when it started, interactivity went to hell in the proverbial
 hand-basket, and the sched_debug log shows some interesting results..
 like spread0 hitting -13659412644, and cc1 being keyed at -3867063305.
 
 cpu#0, 2992.608 MHz
   .nr_running: 4
   .load  : 4096
   .nr_switches   : 1105882
   .nr_load_updates   : 735146
   .nr_uninterruptible: 4294966399

(...)

 cpu#1, 2992.608 MHz
   .nr_running: 5
   .load  : 6208
   .nr_switches   : 1012995
   .nr_load_updates   : 747540
   .nr_uninterruptible: 897

I don't know if this is relevant, but 4294966399 in nr_uninterruptible
for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
I don't know if this rings a bell for someone or if it's a completely
useless comment, but just in case...

HTH,
Willy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Mike Galbraith
On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote:

 I don't know if this is relevant, but 4294966399 in nr_uninterruptible
 for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
 I don't know if this rings a bell for someone or if it's a completely
 useless comment, but just in case...

A task can block on one cpu, and wake up on another, which isn't
tracked, hence the fishy looking numbers.  The true nr_uninterruptible
is the sum of all.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-20 Thread Willy Tarreau
On Fri, Sep 21, 2007 at 04:40:55AM +0200, Mike Galbraith wrote:
 On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote:
 
  I don't know if this is relevant, but 4294966399 in nr_uninterruptible
  for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
  I don't know if this rings a bell for someone or if it's a completely
  useless comment, but just in case...
 
 A task can block on one cpu, and wake up on another, which isn't
 tracked, hence the fishy looking numbers.  The true nr_uninterruptible
 is the sum of all.

OK, thanks Mike for this clarification.

Willy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Mike Galbraith
On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote:

> Were the experiments run on a 2-CPU system?

Yes.

>  When Xorg experiences large 
> wait time, is it on the same CPU that has the two pinned tasks? If this is 
> the case, then the problem could be X somehow advanced faster and got a 
> larger vruntime then the two pinned tasks, so it had to wait for the 
> pinned ones to catch up before it got a chance to be scheduled.

Good question.  I've not yet been able to capture any event where
vruntimes are that far apart in sched_debug.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Tong Li

On Wed, 19 Sep 2007, Siddha, Suresh B wrote:


On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:

This patch attempts to improve CFS's SMP global fairness based on the new
virtual time design.

Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Modified small_imbalance logic in find_busiest_group(). If there's small
imbalance, move tasks from busiest to local sched_group only if the local
group contains a CPU whose min_vruntime is the maximum among all CPUs in
the same sched_domain. This prevents any CPU from advancing too far ahead
in virtual time and avoids tasks thrashing between two CPUs without
utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs,
since the load is not evenly divisible by the number of CPUs, we want the
extra load to have a fair use of every CPU in the system.

Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task


Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue
even with this patch.


Right. For 82 tasks on 8 CPUs, the max error is 7.07% and min is -14.12%. 
It could be fixed by removing the imbalance_pct check (i.e., making 
imbalance_pct effectively 1). The cost would be more migrations, so I 
wasn't sure if that was the approach people would agree on.





if (*imbalance < busiest_load_per_task) {
-   unsigned long tmp, pwr_now, pwr_move;
-   unsigned int imbn;
-
 small_imbalance:
-   pwr_move = pwr_now = 0;
-   imbn = 2;
-   if (this_nr_running) {
-   this_load_per_task /= this_nr_running;
-   if (busiest_load_per_task > this_load_per_task)
-   imbn = 1;
-   } else
-   this_load_per_task = SCHED_LOAD_SCALE;
-
-   if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
-   busiest_load_per_task * imbn) {
-   *imbalance = busiest_load_per_task;
-   return busiest;
-   }


This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC
optimizations.


What are the optimizations? I was trying to simplify the code to use only 
vruntime to control balancing when there's small imbalance, but if it 
breaks non-fairness related optimizations, we can certainly add the code 
back.


  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Siddha, Suresh B
On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:
> This patch attempts to improve CFS's SMP global fairness based on the new 
> virtual time design.
> 
> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
> 
> Modified small_imbalance logic in find_busiest_group(). If there's small 
> imbalance, move tasks from busiest to local sched_group only if the local 
> group contains a CPU whose min_vruntime is the maximum among all CPUs in 
> the same sched_domain. This prevents any CPU from advancing too far ahead 
> in virtual time and avoids tasks thrashing between two CPUs without 
> utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, 
> since the load is not evenly divisible by the number of CPUs, we want the 
> extra load to have a fair use of every CPU in the system.
> 
> Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task 

Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue
even with this patch.

>   if (*imbalance < busiest_load_per_task) {
> - unsigned long tmp, pwr_now, pwr_move;
> - unsigned int imbn;
> -
>  small_imbalance:
> - pwr_move = pwr_now = 0;
> - imbn = 2;
> - if (this_nr_running) {
> - this_load_per_task /= this_nr_running;
> - if (busiest_load_per_task > this_load_per_task)
> - imbn = 1;
> - } else
> - this_load_per_task = SCHED_LOAD_SCALE;
> -
> - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
> - busiest_load_per_task * imbn) {
> - *imbalance = busiest_load_per_task;
> - return busiest;
> - }

This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC
optimizations.

> -
> - /*
> -  * OK, we don't have enough imbalance to justify moving 
> tasks,
> -  * however we may be able to increase total CPU power used by
> -  * moving them.
> + /* 
> +  * When there's small imbalance, move tasks only if this
> +  * sched_group contains a CPU whose min_vruntime is the 
> +  * maximum among all CPUs in the same domain.
>*/
> -
> - pwr_now += busiest->__cpu_power *
> - min(busiest_load_per_task, max_load);
> - pwr_now += this->__cpu_power *
> - min(this_load_per_task, this_load);
> - pwr_now /= SCHED_LOAD_SCALE;
> -
> - /* Amount of load we'd subtract */
> - tmp = sg_div_cpu_power(busiest,
> - busiest_load_per_task * SCHED_LOAD_SCALE);
> - if (max_load > tmp)
> - pwr_move += busiest->__cpu_power *
> - min(busiest_load_per_task, max_load - tmp);
> -
> - /* Amount of load we'd add */
> - if (max_load * busiest->__cpu_power <
> - busiest_load_per_task * SCHED_LOAD_SCALE)
> - tmp = sg_div_cpu_power(this,
> - max_load * busiest->__cpu_power);
> - else
> - tmp = sg_div_cpu_power(this,
> - busiest_load_per_task * SCHED_LOAD_SCALE);
> - pwr_move += this->__cpu_power *
> - min(this_load_per_task, this_load + tmp);
> - pwr_move /= SCHED_LOAD_SCALE;
> -
> - /* Move if we gain throughput */
> - if (pwr_move > pwr_now)
> + if (max_vruntime_group == this)
>   *imbalance = busiest_load_per_task;
> + else
> + *imbalance = 0;

Not sure how this all interacts when some of the cpu's are idle. I have to
look more closely.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Tong Li

On Wed, 19 Sep 2007, Mike Galbraith wrote:


On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote:


The scenario which was previously cured was this:
taskset -c 1 nice -n 0 ./massive_intr 2 
taskset -c 1 nice -n 5 ./massive_intr 2 
click link
(http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
up browser and OpenOffice Impress.

Xorg (at nice -5 + above scenario) latency samples:
se.wait_max  :57985337
se.wait_max  :25163510
se.wait_max  :37005538
se.wait_max  :66986511
se.wait_max  :53990868
se.wait_max  :80976761
se.wait_max  :96967501
se.wait_max  :80989254
se.wait_max  :53990897
se.wait_max  :   181963905
se.wait_max  :85985181


To be doubly sure of the effect on the pinned tasks + migrating Xorg
scenario, I just ran the above test 10 times with virgin devel source.
Maximum Xorg latency was 20ms.

-Mike



Yeah, the patch was a first attempt at getting better global fairness for 
unpinned tasks. I expected there'd be latency problems when unpinned and 
pinned tasks co-exist. The original code for vruntime adjustment in 
set_task_cpu() helped alleviate this problem, so removing it in my patch 
would definitely bring the problem back. On the other hand, leaving it 
there broke global fairness in my fairness benchmark. So we'd need a 
better compromise.


Were the experiments run on a 2-CPU system? When Xorg experiences large 
wait time, is it on the same CPU that has the two pinned tasks? If this is 
the case, then the problem could be X somehow advanced faster and got a 
larger vruntime then the two pinned tasks, so it had to wait for the 
pinned ones to catch up before it got a chance to be scheduled.


  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Mike Galbraith
On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote:

> The scenario which was previously cured was this:
> taskset -c 1 nice -n 0 ./massive_intr 2 
> taskset -c 1 nice -n 5 ./massive_intr 2 
> click link
> (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
> up browser and OpenOffice Impress.
> 
> Xorg (at nice -5 + above scenario) latency samples:
> se.wait_max  :57985337
> se.wait_max  :25163510
> se.wait_max  :37005538
> se.wait_max  :66986511
> se.wait_max  :53990868
> se.wait_max  :80976761
> se.wait_max  :96967501
> se.wait_max  :80989254
> se.wait_max  :53990897
> se.wait_max  :   181963905
> se.wait_max  :85985181

To be doubly sure of the effect on the pinned tasks + migrating Xorg
scenario, I just ran the above test 10 times with virgin devel source.
Maximum Xorg latency was 20ms.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Mike Galbraith
On Wed, 2007-09-19 at 08:28 +0200, Mike Galbraith wrote:
> Greetings,
> 
> On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote:
> > This patch attempts to improve CFS's SMP global fairness based on the new 
> > virtual time design.
> > 
> > Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
> 
> Since I'm (still) encountering Xorg latency issues (which go away if
> load is hefty instead of light) even with that migration adjustment and
> synchronization, and am having difficulty nailing it down to a specific
> event, I'll test this immediately.

(had to apply manually to freshly pulled tree)

Drat.  This didn't cure the latency hits with a Xorg at nice -5 running
with a make -j2 at nice 0, but seems to have reinstated a latency issue
which was previously cured.

Xorg 1 sec. max latency samples:
(trimmed to only show >20ms latencies)
se.wait_max  :23343582
se.wait_max  :20119460
se.wait_max  :20771573
se.wait_max  :21084567
se.wait_max  :31338500
se.wait_max  :35368148
se.wait_max  :39199642
se.wait_max  :22889062
se.wait_max  :40285501
se.wait_max  :21002720
se.wait_max  :21002266
se.wait_max  :21680578
se.wait_max  :22012913
se.wait_max  :94646331
se.wait_max  :29003693
se.wait_max  :20812613

(boot with maxcpus=1 or nail X+make to one cpu and these latencies are
gone, so it does seem to be the migration logic - why i was so
interested in testing your patch)

The scenario which was previously cured was this:
taskset -c 1 nice -n 0 ./massive_intr 2 
taskset -c 1 nice -n 5 ./massive_intr 2 
click link
(http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
up browser and OpenOffice Impress.

Xorg (at nice -5 + above scenario) latency samples:
se.wait_max  :57985337
se.wait_max  :25163510
se.wait_max  :37005538
se.wait_max  :66986511
se.wait_max  :53990868
se.wait_max  :80976761
se.wait_max  :96967501
se.wait_max  :80989254
se.wait_max  :53990897
se.wait_max  :   181963905
se.wait_max  :85985181

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Mike Galbraith
Greetings,

On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote:
> This patch attempts to improve CFS's SMP global fairness based on the new 
> virtual time design.
> 
> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Since I'm (still) encountering Xorg latency issues (which go away if
load is hefty instead of light) even with that migration adjustment and
synchronization, and am having difficulty nailing it down to a specific
event, I'll test this immediately.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Tong Li
This patch attempts to improve CFS's SMP global fairness based on the new 
virtual time design.


Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Modified small_imbalance logic in find_busiest_group(). If there's small 
imbalance, move tasks from busiest to local sched_group only if the local 
group contains a CPU whose min_vruntime is the maximum among all CPUs in 
the same sched_domain. This prevents any CPU from advancing too far ahead 
in virtual time and avoids tasks thrashing between two CPUs without 
utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, 
since the load is not evenly divisible by the number of CPUs, we want the 
extra load to have a fair use of every CPU in the system.


Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task 
runs a trivial while (1) loop. The benchmark runs for 300 seconds and, at 
every T seconds, it samples for each task the following:


1. Actual CPU time the task received during the past 60 seconds.

2. Ideal CPU time it would receive under a perfect fair scheduler.

3. Lag = ideal time - actual time, where a positive lag means the task 
received less CPU time than its fair share and negative means it received 
more.


4. Error = lag / ideal time

The following shows the max and min errors among all samples for all tasks 
before and after applying the patch:


Before:

Sampling interval: 30 s
Max error: 100.00%
Min error: -25.00%

Sampling interval: 10 s
Max error: 27.62%
Min error: -25.00%

After:

Sampling interval: 30 s
Max error: 1.33%
Min error: -1.29%

Sampling interval: 10 s
Max error: 7.38%
Min error: -6.25%

The errors for the 10s sampling interval are still not as small as I had 
hoped for, but looks like it does have some improvement.


   tong

Signed-off-by: Tong Li <[EMAIL PROTECTED]>
---
--- linux-2.6-sched-devel-orig/kernel/sched.c   2007-09-15 22:00:48.0 
-0700
+++ linux-2.6-sched-devel/kernel/sched.c2007-09-18 22:10:52.0 
-0700
@@ -1033,9 +1033,6 @@ void set_task_cpu(struct task_struct *p,
if (p->se.block_start)
p->se.block_start -= clock_offset;
 #endif
-   if (likely(new_rq->cfs.min_vruntime))
-   p->se.vruntime -= old_rq->cfs.min_vruntime -
-   new_rq->cfs.min_vruntime;

__set_task_cpu(p, new_cpu);
 }
@@ -1599,6 +1596,7 @@ static void __sched_fork(struct task_str
p->se.exec_start = 0;
p->se.sum_exec_runtime   = 0;
p->se.prev_sum_exec_runtime  = 0;
+   p->se.vruntime   = 0;

 #ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
@@ -2277,6 +2275,8 @@ find_busiest_group(struct sched_domain *
   int *sd_idle, cpumask_t *cpus, int *balance)
 {
struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
+   struct sched_group *max_vruntime_group = NULL;
+   u64 max_vruntime = 0;
unsigned long max_load, avg_load, total_load, this_load, total_pwr;
unsigned long max_pull;
unsigned long busiest_load_per_task, busiest_nr_running;
@@ -2322,6 +2322,11 @@ find_busiest_group(struct sched_domain *

rq = cpu_rq(i);

+   if (rq->cfs.min_vruntime > max_vruntime) {
+   max_vruntime = rq->cfs.min_vruntime;
+   max_vruntime_group = group;
+   }
+
if (*sd_idle && rq->nr_running)
*sd_idle = 0;

@@ -2483,59 +2488,16 @@ group_next:
 * moved
 */
if (*imbalance < busiest_load_per_task) {
-   unsigned long tmp, pwr_now, pwr_move;
-   unsigned int imbn;
-
 small_imbalance:
-   pwr_move = pwr_now = 0;
-   imbn = 2;
-   if (this_nr_running) {
-   this_load_per_task /= this_nr_running;
-   if (busiest_load_per_task > this_load_per_task)
-   imbn = 1;
-   } else
-   this_load_per_task = SCHED_LOAD_SCALE;
-
-   if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
-   busiest_load_per_task * imbn) {
-   *imbalance = busiest_load_per_task;
-   return busiest;
-   }
-
-   /*
-* OK, we don't have enough imbalance to justify moving tasks,
-* however we may be able to increase total CPU power used by
-* moving them.
+		/* 
+		 * When there's small imbalance, move tasks only if this
+		 * sched_group contains a CPU whose min_vruntime is the 
+		 * maximum among all CPUs in the same domain.

 */
-
-   pwr_now += busiest->__cpu_power *
-   min(busiest_load_per_task, max_load);
-  

Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Tong Li
This patch attempts to improve CFS's SMP global fairness based on the new 
virtual time design.


Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Modified small_imbalance logic in find_busiest_group(). If there's small 
imbalance, move tasks from busiest to local sched_group only if the local 
group contains a CPU whose min_vruntime is the maximum among all CPUs in 
the same sched_domain. This prevents any CPU from advancing too far ahead 
in virtual time and avoids tasks thrashing between two CPUs without 
utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, 
since the load is not evenly divisible by the number of CPUs, we want the 
extra load to have a fair use of every CPU in the system.


Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task 
runs a trivial while (1) loop. The benchmark runs for 300 seconds and, at 
every T seconds, it samples for each task the following:


1. Actual CPU time the task received during the past 60 seconds.

2. Ideal CPU time it would receive under a perfect fair scheduler.

3. Lag = ideal time - actual time, where a positive lag means the task 
received less CPU time than its fair share and negative means it received 
more.


4. Error = lag / ideal time

The following shows the max and min errors among all samples for all tasks 
before and after applying the patch:


Before:

Sampling interval: 30 s
Max error: 100.00%
Min error: -25.00%

Sampling interval: 10 s
Max error: 27.62%
Min error: -25.00%

After:

Sampling interval: 30 s
Max error: 1.33%
Min error: -1.29%

Sampling interval: 10 s
Max error: 7.38%
Min error: -6.25%

The errors for the 10s sampling interval are still not as small as I had 
hoped for, but looks like it does have some improvement.


   tong

Signed-off-by: Tong Li [EMAIL PROTECTED]
---
--- linux-2.6-sched-devel-orig/kernel/sched.c   2007-09-15 22:00:48.0 
-0700
+++ linux-2.6-sched-devel/kernel/sched.c2007-09-18 22:10:52.0 
-0700
@@ -1033,9 +1033,6 @@ void set_task_cpu(struct task_struct *p,
if (p-se.block_start)
p-se.block_start -= clock_offset;
 #endif
-   if (likely(new_rq-cfs.min_vruntime))
-   p-se.vruntime -= old_rq-cfs.min_vruntime -
-   new_rq-cfs.min_vruntime;

__set_task_cpu(p, new_cpu);
 }
@@ -1599,6 +1596,7 @@ static void __sched_fork(struct task_str
p-se.exec_start = 0;
p-se.sum_exec_runtime   = 0;
p-se.prev_sum_exec_runtime  = 0;
+   p-se.vruntime   = 0;

 #ifdef CONFIG_SCHEDSTATS
p-se.wait_start = 0;
@@ -2277,6 +2275,8 @@ find_busiest_group(struct sched_domain *
   int *sd_idle, cpumask_t *cpus, int *balance)
 {
struct sched_group *busiest = NULL, *this = NULL, *group = sd-groups;
+   struct sched_group *max_vruntime_group = NULL;
+   u64 max_vruntime = 0;
unsigned long max_load, avg_load, total_load, this_load, total_pwr;
unsigned long max_pull;
unsigned long busiest_load_per_task, busiest_nr_running;
@@ -2322,6 +2322,11 @@ find_busiest_group(struct sched_domain *

rq = cpu_rq(i);

+   if (rq-cfs.min_vruntime  max_vruntime) {
+   max_vruntime = rq-cfs.min_vruntime;
+   max_vruntime_group = group;
+   }
+
if (*sd_idle  rq-nr_running)
*sd_idle = 0;

@@ -2483,59 +2488,16 @@ group_next:
 * moved
 */
if (*imbalance  busiest_load_per_task) {
-   unsigned long tmp, pwr_now, pwr_move;
-   unsigned int imbn;
-
 small_imbalance:
-   pwr_move = pwr_now = 0;
-   imbn = 2;
-   if (this_nr_running) {
-   this_load_per_task /= this_nr_running;
-   if (busiest_load_per_task  this_load_per_task)
-   imbn = 1;
-   } else
-   this_load_per_task = SCHED_LOAD_SCALE;
-
-   if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ =
-   busiest_load_per_task * imbn) {
-   *imbalance = busiest_load_per_task;
-   return busiest;
-   }
-
-   /*
-* OK, we don't have enough imbalance to justify moving tasks,
-* however we may be able to increase total CPU power used by
-* moving them.
+		/* 
+		 * When there's small imbalance, move tasks only if this
+		 * sched_group contains a CPU whose min_vruntime is the 
+		 * maximum among all CPUs in the same domain.

 */
-
-   pwr_now += busiest-__cpu_power *
-   min(busiest_load_per_task, max_load);
-   pwr_now += 

Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Mike Galbraith
Greetings,

On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote:
 This patch attempts to improve CFS's SMP global fairness based on the new 
 virtual time design.
 
 Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Since I'm (still) encountering Xorg latency issues (which go away if
load is hefty instead of light) even with that migration adjustment and
synchronization, and am having difficulty nailing it down to a specific
event, I'll test this immediately.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Mike Galbraith
On Wed, 2007-09-19 at 08:28 +0200, Mike Galbraith wrote:
 Greetings,
 
 On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote:
  This patch attempts to improve CFS's SMP global fairness based on the new 
  virtual time design.
  
  Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
 
 Since I'm (still) encountering Xorg latency issues (which go away if
 load is hefty instead of light) even with that migration adjustment and
 synchronization, and am having difficulty nailing it down to a specific
 event, I'll test this immediately.

(had to apply manually to freshly pulled tree)

Drat.  This didn't cure the latency hits with a Xorg at nice -5 running
with a make -j2 at nice 0, but seems to have reinstated a latency issue
which was previously cured.

Xorg 1 sec. max latency samples:
(trimmed to only show 20ms latencies)
se.wait_max  :23343582
se.wait_max  :20119460
se.wait_max  :20771573
se.wait_max  :21084567
se.wait_max  :31338500
se.wait_max  :35368148
se.wait_max  :39199642
se.wait_max  :22889062
se.wait_max  :40285501
se.wait_max  :21002720
se.wait_max  :21002266
se.wait_max  :21680578
se.wait_max  :22012913
se.wait_max  :94646331
se.wait_max  :29003693
se.wait_max  :20812613

(boot with maxcpus=1 or nail X+make to one cpu and these latencies are
gone, so it does seem to be the migration logic - why i was so
interested in testing your patch)

The scenario which was previously cured was this:
taskset -c 1 nice -n 0 ./massive_intr 2 
taskset -c 1 nice -n 5 ./massive_intr 2 
click link
(http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
up browser and OpenOffice Impress.

Xorg (at nice -5 + above scenario) latency samples:
se.wait_max  :57985337
se.wait_max  :25163510
se.wait_max  :37005538
se.wait_max  :66986511
se.wait_max  :53990868
se.wait_max  :80976761
se.wait_max  :96967501
se.wait_max  :80989254
se.wait_max  :53990897
se.wait_max  :   181963905
se.wait_max  :85985181

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Mike Galbraith
On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote:

 The scenario which was previously cured was this:
 taskset -c 1 nice -n 0 ./massive_intr 2 
 taskset -c 1 nice -n 5 ./massive_intr 2 
 click link
 (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
 up browser and OpenOffice Impress.
 
 Xorg (at nice -5 + above scenario) latency samples:
 se.wait_max  :57985337
 se.wait_max  :25163510
 se.wait_max  :37005538
 se.wait_max  :66986511
 se.wait_max  :53990868
 se.wait_max  :80976761
 se.wait_max  :96967501
 se.wait_max  :80989254
 se.wait_max  :53990897
 se.wait_max  :   181963905
 se.wait_max  :85985181

To be doubly sure of the effect on the pinned tasks + migrating Xorg
scenario, I just ran the above test 10 times with virgin devel source.
Maximum Xorg latency was 20ms.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Tong Li

On Wed, 19 Sep 2007, Mike Galbraith wrote:


On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote:


The scenario which was previously cured was this:
taskset -c 1 nice -n 0 ./massive_intr 2 
taskset -c 1 nice -n 5 ./massive_intr 2 
click link
(http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
up browser and OpenOffice Impress.

Xorg (at nice -5 + above scenario) latency samples:
se.wait_max  :57985337
se.wait_max  :25163510
se.wait_max  :37005538
se.wait_max  :66986511
se.wait_max  :53990868
se.wait_max  :80976761
se.wait_max  :96967501
se.wait_max  :80989254
se.wait_max  :53990897
se.wait_max  :   181963905
se.wait_max  :85985181


To be doubly sure of the effect on the pinned tasks + migrating Xorg
scenario, I just ran the above test 10 times with virgin devel source.
Maximum Xorg latency was 20ms.

-Mike



Yeah, the patch was a first attempt at getting better global fairness for 
unpinned tasks. I expected there'd be latency problems when unpinned and 
pinned tasks co-exist. The original code for vruntime adjustment in 
set_task_cpu() helped alleviate this problem, so removing it in my patch 
would definitely bring the problem back. On the other hand, leaving it 
there broke global fairness in my fairness benchmark. So we'd need a 
better compromise.


Were the experiments run on a 2-CPU system? When Xorg experiences large 
wait time, is it on the same CPU that has the two pinned tasks? If this is 
the case, then the problem could be X somehow advanced faster and got a 
larger vruntime then the two pinned tasks, so it had to wait for the 
pinned ones to catch up before it got a chance to be scheduled.


  tong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Siddha, Suresh B
On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:
 This patch attempts to improve CFS's SMP global fairness based on the new 
 virtual time design.
 
 Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
 
 Modified small_imbalance logic in find_busiest_group(). If there's small 
 imbalance, move tasks from busiest to local sched_group only if the local 
 group contains a CPU whose min_vruntime is the maximum among all CPUs in 
 the same sched_domain. This prevents any CPU from advancing too far ahead 
 in virtual time and avoids tasks thrashing between two CPUs without 
 utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, 
 since the load is not evenly divisible by the number of CPUs, we want the 
 extra load to have a fair use of every CPU in the system.
 
 Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task 

Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue
even with this patch.

   if (*imbalance  busiest_load_per_task) {
 - unsigned long tmp, pwr_now, pwr_move;
 - unsigned int imbn;
 -
  small_imbalance:
 - pwr_move = pwr_now = 0;
 - imbn = 2;
 - if (this_nr_running) {
 - this_load_per_task /= this_nr_running;
 - if (busiest_load_per_task  this_load_per_task)
 - imbn = 1;
 - } else
 - this_load_per_task = SCHED_LOAD_SCALE;
 -
 - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ =
 - busiest_load_per_task * imbn) {
 - *imbalance = busiest_load_per_task;
 - return busiest;
 - }

This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC
optimizations.

 -
 - /*
 -  * OK, we don't have enough imbalance to justify moving 
 tasks,
 -  * however we may be able to increase total CPU power used by
 -  * moving them.
 + /* 
 +  * When there's small imbalance, move tasks only if this
 +  * sched_group contains a CPU whose min_vruntime is the 
 +  * maximum among all CPUs in the same domain.
*/
 -
 - pwr_now += busiest-__cpu_power *
 - min(busiest_load_per_task, max_load);
 - pwr_now += this-__cpu_power *
 - min(this_load_per_task, this_load);
 - pwr_now /= SCHED_LOAD_SCALE;
 -
 - /* Amount of load we'd subtract */
 - tmp = sg_div_cpu_power(busiest,
 - busiest_load_per_task * SCHED_LOAD_SCALE);
 - if (max_load  tmp)
 - pwr_move += busiest-__cpu_power *
 - min(busiest_load_per_task, max_load - tmp);
 -
 - /* Amount of load we'd add */
 - if (max_load * busiest-__cpu_power 
 - busiest_load_per_task * SCHED_LOAD_SCALE)
 - tmp = sg_div_cpu_power(this,
 - max_load * busiest-__cpu_power);
 - else
 - tmp = sg_div_cpu_power(this,
 - busiest_load_per_task * SCHED_LOAD_SCALE);
 - pwr_move += this-__cpu_power *
 - min(this_load_per_task, this_load + tmp);
 - pwr_move /= SCHED_LOAD_SCALE;
 -
 - /* Move if we gain throughput */
 - if (pwr_move  pwr_now)
 + if (max_vruntime_group == this)
   *imbalance = busiest_load_per_task;
 + else
 + *imbalance = 0;

Not sure how this all interacts when some of the cpu's are idle. I have to
look more closely.

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Tong Li

On Wed, 19 Sep 2007, Siddha, Suresh B wrote:


On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:

This patch attempts to improve CFS's SMP global fairness based on the new
virtual time design.

Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Modified small_imbalance logic in find_busiest_group(). If there's small
imbalance, move tasks from busiest to local sched_group only if the local
group contains a CPU whose min_vruntime is the maximum among all CPUs in
the same sched_domain. This prevents any CPU from advancing too far ahead
in virtual time and avoids tasks thrashing between two CPUs without
utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs,
since the load is not evenly divisible by the number of CPUs, we want the
extra load to have a fair use of every CPU in the system.

Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task


Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue
even with this patch.


Right. For 82 tasks on 8 CPUs, the max error is 7.07% and min is -14.12%. 
It could be fixed by removing the imbalance_pct check (i.e., making 
imbalance_pct effectively 1). The cost would be more migrations, so I 
wasn't sure if that was the approach people would agree on.





if (*imbalance  busiest_load_per_task) {
-   unsigned long tmp, pwr_now, pwr_move;
-   unsigned int imbn;
-
 small_imbalance:
-   pwr_move = pwr_now = 0;
-   imbn = 2;
-   if (this_nr_running) {
-   this_load_per_task /= this_nr_running;
-   if (busiest_load_per_task  this_load_per_task)
-   imbn = 1;
-   } else
-   this_load_per_task = SCHED_LOAD_SCALE;
-
-   if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ =
-   busiest_load_per_task * imbn) {
-   *imbalance = busiest_load_per_task;
-   return busiest;
-   }


This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC
optimizations.


What are the optimizations? I was trying to simplify the code to use only 
vruntime to control balancing when there's small imbalance, but if it 
breaks non-fairness related optimizations, we can certainly add the code 
back.


  tong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Mike Galbraith
On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote:

 Were the experiments run on a 2-CPU system?

Yes.

  When Xorg experiences large 
 wait time, is it on the same CPU that has the two pinned tasks? If this is 
 the case, then the problem could be X somehow advanced faster and got a 
 larger vruntime then the two pinned tasks, so it had to wait for the 
 pinned ones to catch up before it got a chance to be scheduled.

Good question.  I've not yet been able to capture any event where
vruntimes are that far apart in sched_debug.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Srivatsa Vaddagiri
On Tue, Sep 18, 2007 at 10:22:43PM +0200, Ingo Molnar wrote:
> (I have not tested the group scheduling bits but perhaps Srivatsa would 
> like to do that?)

Ingo,
I plan to test it today and send you any updates that may be
required.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Ingo Molnar

* Dmitry Adamushko <[EMAIL PROTECTED]> wrote:

> (3)
> 
> rework enqueue/dequeue_entity() to get rid of 
> sched_class::set_curr_task(). This simplifies sched_setscheduler(), 
> rt_mutex_setprio() and sched_move_tasks().

ah. This makes your ready-queue patch a code size win. Applied.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Ingo Molnar

* dimm <[EMAIL PROTECTED]> wrote:

> (1)
> 
> due to the fact that we no longer keep the 'current' within the tree, 
> dequeue/enqueue_entity() is useless for the 'current' in 
> task_new_fair(). We are about to reschedule and 
> sched_class->put_prev_task() will put the 'current' back into the 
> tree, based on its new key.
> 
> Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]>

thanks, applied.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Ingo Molnar

* Dmitry Adamushko <[EMAIL PROTECTED]> wrote:

> (2)
> 
> the 'p' (task_struct) parameter in the sched_class :: yield_task() is 
> redundant as the caller is always the 'current'. Get rid of it.
> 
> Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]>

ah - good one! I completely forgot about that parameter and it's indeed 
unneeded. Applied.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Ingo Molnar

* dimm <[EMAIL PROTECTED]> wrote:

> [ well, don't expect to find here anything like RDCFS (no, 'D' does 
> not stand for 'dumb'!). I was focused on more prosaic things in the 
> mean time so just didn't have time for writing it.. ]
> 
> here is a few cleanup/simplification/optimization(s) based on the 
> recent modifications in the sched-dev tree.
> 
> (1) optimize task_new_fair()
> (2) simplify yield_task()
> (3) rework enqueue/dequeue_entity() to get rid of
> sched_class::set_curr_task()

the queue with your enhancements and simplifications applied looks good 
here, and it booted fine on two testboxes. I've updated the 
sched-devel.git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

(below is the full shortlog over current upstream.)

(I have not tested the group scheduling bits but perhaps Srivatsa would 
like to do that?)

Ingo

-->
Dmitry Adamushko (9):
  sched: clean up struct load_stat
  sched: clean up schedstat block in dequeue_entity()
  sched: sched_setscheduler() fix
  sched: add set_curr_task() calls
  sched: do not keep current in the tree and get rid of 
sched_entity::fair_key
  sched: yield-workaround update
  sched: optimize task_new_fair()
  sched: simplify sched_class::yield_task()
  sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()

Ingo Molnar (29):
  sched: fix new-task method
  sched: resched task in task_new_fair()
  sched: small sched_debug cleanup
  sched: debug: track maximum 'slice'
  sched: uniform tunings
  sched: use constants if !CONFIG_SCHED_DEBUG
  sched: remove stat_gran
  sched: remove precise CPU load
  sched: remove precise CPU load calculations #2
  sched: track cfs_rq->curr on !group-scheduling too
  sched: cleanup: simplify cfs_rq_curr() methods
  sched: uninline __enqueue_entity()/__dequeue_entity()
  sched: speed up update_load_add/_sub()
  sched: clean up calc_weighted()
  sched: introduce se->vruntime
  sched: move sched_feat() definitions
  sched: optimize vruntime based scheduling
  sched: simplify check_preempt() methods
  sched: wakeup granularity fix
  sched: add se->vruntime debugging
  sched: sync up ->min_vruntime when going idle
  sched: add more vruntime statistics
  sched: debug: update exec_clock only when SCHED_DEBUG
  sched: remove wait_runtime limit
  sched: remove wait_runtime fields and features
  sched: x86: allow single-depth wchan output
  sched: yield workaround
  sched: fix delay accounting performance regression
  sched: disable START_DEBIT

Matthias Kaehlcke (1):
  sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (1):
  sched: fix SMP migration latencies

Peter Zijlstra (7):
  sched: simplify SCHED_FEAT_* code
  sched: new task placement for vruntime
  sched: simplify adaptive latency
  sched: clean up new task placement
  sched: add tree based averages
  sched: handle vruntime overflow
  sched: better min_vruntime tracking

Srivatsa Vaddagiri (1):
  sched: group-scheduler core

 arch/i386/Kconfig |   11 
 include/linux/sched.h |   24 -
 init/Kconfig  |9 
 kernel/sched.c|  544 
 kernel/sched_debug.c  |  100 ++
 kernel/sched_fair.c   |  752 --
 kernel/sched_rt.c |4 
 kernel/sched_stats.h  |4 
 kernel/sysctl.c   |   35 +-
 9 files changed, 763 insertions(+), 720 deletions(-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Dmitry Adamushko

(3)

rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task().
This simplifies sched_setscheduler(), rt_mutex_setprio() and sched_move_tasks().

Signed-off-by : Dmitry Adamushko <[EMAIL PROTECTED]>
Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]>

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3728cd6..1094804 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -870,7 +870,6 @@ struct sched_class {
struct sched_domain *sd, enum cpu_idle_type idle,
int *all_pinned, int *this_best_prio);
 
-   void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
 };
diff --git a/kernel/sched.c b/kernel/sched.c
index 361fad8..bc1a625 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3910,8 +3910,8 @@ EXPORT_SYMBOL(sleep_on_timeout);
  */
 void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-   int oldprio, on_rq, running;
unsigned long flags;
+   int oldprio, on_rq;
struct rq *rq;
 
BUG_ON(prio < 0 || prio > MAX_PRIO);
@@ -3921,12 +3921,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 
oldprio = p->prio;
on_rq = p->se.on_rq;
-   running = task_running(rq, p);
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, p, 0);
-   if (running)
-   p->sched_class->put_prev_task(rq, p);
-   }
 
if (rt_prio(prio))
p->sched_class = _sched_class;
@@ -3936,15 +3932,13 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
p->prio = prio;
 
if (on_rq) {
-   if (running)
-   p->sched_class->set_curr_task(rq);
enqueue_task(rq, p, 0);
/*
 * Reschedule if we are currently running on this runqueue and
 * our priority decreased, or if we are not currently running on
 * this runqueue and our priority is higher than the current's
 */
-   if (running) {
+   if (task_running(rq, p)) {
if (p->prio > oldprio)
resched_task(rq->curr);
} else {
@@ -4150,7 +4144,7 @@ __setscheduler(struct rq *rq, struct task_struct *p, int 
policy, int prio)
 int sched_setscheduler(struct task_struct *p, int policy,
   struct sched_param *param)
 {
-   int retval, oldprio, oldpolicy = -1, on_rq, running;
+   int retval, oldprio, oldpolicy = -1, on_rq;
unsigned long flags;
struct rq *rq;
 
@@ -4232,24 +4226,20 @@ recheck:
}
update_rq_clock(rq);
on_rq = p->se.on_rq;
-   running = task_running(rq, p);
-   if (on_rq) {
+   if (on_rq)
deactivate_task(rq, p, 0);
-   if (running)
-   p->sched_class->put_prev_task(rq, p);
-   }
+
oldprio = p->prio;
__setscheduler(rq, p, policy, param->sched_priority);
+
if (on_rq) {
-   if (running)
-   p->sched_class->set_curr_task(rq);
activate_task(rq, p, 0);
/*
 * Reschedule if we are currently running on this runqueue and
 * our priority decreased, or if we are not currently running on
 * this runqueue and our priority is higher than the current's
 */
-   if (running) {
+   if (task_running(rq, p)) {
if (p->prio > oldprio)
resched_task(rq->curr);
} else {
@@ -6853,19 +6843,13 @@ static void sched_move_task(struct container_subsys 
*ss, struct container *cont,
running = task_running(rq, tsk);
on_rq = tsk->se.on_rq;
 
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, tsk, 0);
-   if (unlikely(running))
-   tsk->sched_class->put_prev_task(rq, tsk);
-   }
 
set_task_cfs_rq(tsk);
 
-   if (on_rq) {
-   if (unlikely(running))
-   tsk->sched_class->set_curr_task(rq);
+   if (on_rq)
enqueue_task(rq, tsk, 0);
-   }
 
 done:
task_rq_unlock(rq, );
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index e65af8c..6539377 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -475,9 +475,20 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int initial)
 }
 
 static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+   int wakeup, int set_curr)
 {
/*
+* In case of the 'current'.
+*/
+   if (unlikely(set_curr)) {
+   update_stats_curr_start(cfs_rq, se);

[git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Dmitry Adamushko

(2)

the 'p' (task_struct) parameter in the sched_class :: yield_task()
is redundant as the caller is always the 'current'. Get rid of it.

Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]>

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9fd936f..3728cd6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -857,7 +857,7 @@ struct sched_class {
 
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
-   void (*yield_task) (struct rq *rq, struct task_struct *p);
+   void (*yield_task) (struct rq *rq);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 046dae1..361fad8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4535,7 +4535,7 @@ asmlinkage long sys_sched_yield(void)
if (unlikely(rq->nr_running == 1))
schedstat_inc(rq, yld_act_empty);
else
-   current->sched_class->yield_task(rq, current);
+   current->sched_class->yield_task(rq);
 
/*
 * Since we are going to call schedule() anyway, there's
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5a244e2..9b982ef 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -723,10 +723,12 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int sleep)
 /*
  * sched_yield() support is very simple - we dequeue and enqueue
  */
-static void yield_task_fair(struct rq *rq, struct task_struct *p)
+static void yield_task_fair(struct rq *rq)
 {
+   struct task_struct *curr = rq->curr;
+   struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+
if (!sysctl_sched_yield_bug_workaround) {
-   struct cfs_rq *cfs_rq = task_cfs_rq(p);
__update_rq_clock(rq);
 
/*
@@ -737,7 +739,6 @@ static void yield_task_fair(struct rq *rq, struct 
task_struct *p)
}
 
if (sysctl_sched_yield_bug_workaround == 1) {
-   struct cfs_rq *cfs_rq = task_cfs_rq(p);
struct sched_entity *next;
 
/*
@@ -757,7 +758,7 @@ static void yield_task_fair(struct rq *rq, struct 
task_struct *p)
/*
 * Minimally necessary key value to be the second in the tree:
 */
-   p->se.vruntime = next->vruntime +
+   curr->se.vruntime = next->vruntime +
sysctl_sched_yield_granularity;
 
/*
@@ -770,7 +771,7 @@ static void yield_task_fair(struct rq *rq, struct 
task_struct *p)
/*
 * Just reschedule, do nothing else:
 */
-   resched_task(p);
+   resched_task(curr);
 }
 
 /*
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 45b339f..b86944c 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -59,9 +59,9 @@ static void requeue_task_rt(struct rq *rq, struct task_struct 
*p)
 }
 
 static void
-yield_task_rt(struct rq *rq, struct task_struct *p)
+yield_task_rt(struct rq *rq)
 {
-   requeue_task_rt(rq, p);
+   requeue_task_rt(rq, rq->curr);
 }
 
 /*

---

Dmitry




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread dimm

[ well, don't expect to find here anything like RDCFS
(no, 'D' does not stand for 'dumb'!). I was focused
on more prosaic things in the mean time so just
didn't have time for writing it.. ]

here is a few cleanup/simplification/optimization(s)
based on the recent modifications in the sched-dev tree.

(1) optimize task_new_fair()
(2) simplify yield_task()
(3) rework enqueue/dequeue_entity() to get rid of
sched_class::set_curr_task()

additionally, the changes somewhat decrease code size:

   textdata bss dec hex filename
  435385398  48   48984bf58 build/kernel/sched.o.before
  432505390  48   48688be30 build/kernel/sched.o

(SMP + lots of debugging options but, I guess, in this case the diff
should remain visible for any combination).

---

(1)

due to the fact that we no longer keep the 'current' within the tree,
dequeue/enqueue_entity() is useless for the 'current' in task_new_fair().
We are about to reschedule and sched_class->put_prev_task() will put
the 'current' back into the tree, based on its new key.

Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]>

---
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6e52d5a..5a244e2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -969,10 +969,11 @@ static void task_new_fair(struct rq *rq, struct 
task_struct *p)
 
if (sysctl_sched_child_runs_first &&
curr->vruntime < se->vruntime) {
-
-   dequeue_entity(cfs_rq, curr, 0);
+   /*
+* Upon rescheduling, sched_class::put_prev_task() will place
+* 'current' within the tree based on its new key value.
+*/
swap(curr->vruntime, se->vruntime);
-   enqueue_entity(cfs_rq, curr, 0);
}
 
update_stats_enqueue(cfs_rq, se);

---

Dmitry



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread dimm

[ well, don't expect to find here anything like RDCFS
(no, 'D' does not stand for 'dumb'!). I was focused
on more prosaic things in the mean time so just
didn't have time for writing it.. ]

here is a few cleanup/simplification/optimization(s)
based on the recent modifications in the sched-dev tree.

(1) optimize task_new_fair()
(2) simplify yield_task()
(3) rework enqueue/dequeue_entity() to get rid of
sched_class::set_curr_task()

additionally, the changes somewhat decrease code size:

   textdata bss dec hex filename
  435385398  48   48984bf58 build/kernel/sched.o.before
  432505390  48   48688be30 build/kernel/sched.o

(SMP + lots of debugging options but, I guess, in this case the diff
should remain visible for any combination).

---

(1)

due to the fact that we no longer keep the 'current' within the tree,
dequeue/enqueue_entity() is useless for the 'current' in task_new_fair().
We are about to reschedule and sched_class-put_prev_task() will put
the 'current' back into the tree, based on its new key.

Signed-off-by: Dmitry Adamushko [EMAIL PROTECTED]

---
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6e52d5a..5a244e2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -969,10 +969,11 @@ static void task_new_fair(struct rq *rq, struct 
task_struct *p)
 
if (sysctl_sched_child_runs_first 
curr-vruntime  se-vruntime) {
-
-   dequeue_entity(cfs_rq, curr, 0);
+   /*
+* Upon rescheduling, sched_class::put_prev_task() will place
+* 'current' within the tree based on its new key value.
+*/
swap(curr-vruntime, se-vruntime);
-   enqueue_entity(cfs_rq, curr, 0);
}
 
update_stats_enqueue(cfs_rq, se);

---

Dmitry



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Dmitry Adamushko

(2)

the 'p' (task_struct) parameter in the sched_class :: yield_task()
is redundant as the caller is always the 'current'. Get rid of it.

Signed-off-by: Dmitry Adamushko [EMAIL PROTECTED]

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9fd936f..3728cd6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -857,7 +857,7 @@ struct sched_class {
 
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
-   void (*yield_task) (struct rq *rq, struct task_struct *p);
+   void (*yield_task) (struct rq *rq);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 046dae1..361fad8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4535,7 +4535,7 @@ asmlinkage long sys_sched_yield(void)
if (unlikely(rq-nr_running == 1))
schedstat_inc(rq, yld_act_empty);
else
-   current-sched_class-yield_task(rq, current);
+   current-sched_class-yield_task(rq);
 
/*
 * Since we are going to call schedule() anyway, there's
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5a244e2..9b982ef 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -723,10 +723,12 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int sleep)
 /*
  * sched_yield() support is very simple - we dequeue and enqueue
  */
-static void yield_task_fair(struct rq *rq, struct task_struct *p)
+static void yield_task_fair(struct rq *rq)
 {
+   struct task_struct *curr = rq-curr;
+   struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+
if (!sysctl_sched_yield_bug_workaround) {
-   struct cfs_rq *cfs_rq = task_cfs_rq(p);
__update_rq_clock(rq);
 
/*
@@ -737,7 +739,6 @@ static void yield_task_fair(struct rq *rq, struct 
task_struct *p)
}
 
if (sysctl_sched_yield_bug_workaround == 1) {
-   struct cfs_rq *cfs_rq = task_cfs_rq(p);
struct sched_entity *next;
 
/*
@@ -757,7 +758,7 @@ static void yield_task_fair(struct rq *rq, struct 
task_struct *p)
/*
 * Minimally necessary key value to be the second in the tree:
 */
-   p-se.vruntime = next-vruntime +
+   curr-se.vruntime = next-vruntime +
sysctl_sched_yield_granularity;
 
/*
@@ -770,7 +771,7 @@ static void yield_task_fair(struct rq *rq, struct 
task_struct *p)
/*
 * Just reschedule, do nothing else:
 */
-   resched_task(p);
+   resched_task(curr);
 }
 
 /*
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 45b339f..b86944c 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -59,9 +59,9 @@ static void requeue_task_rt(struct rq *rq, struct task_struct 
*p)
 }
 
 static void
-yield_task_rt(struct rq *rq, struct task_struct *p)
+yield_task_rt(struct rq *rq)
 {
-   requeue_task_rt(rq, p);
+   requeue_task_rt(rq, rq-curr);
 }
 
 /*

---

Dmitry




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Dmitry Adamushko

(3)

rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task().
This simplifies sched_setscheduler(), rt_mutex_setprio() and sched_move_tasks().

Signed-off-by : Dmitry Adamushko [EMAIL PROTECTED]
Signed-off-by : Srivatsa Vaddagiri [EMAIL PROTECTED]

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3728cd6..1094804 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -870,7 +870,6 @@ struct sched_class {
struct sched_domain *sd, enum cpu_idle_type idle,
int *all_pinned, int *this_best_prio);
 
-   void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
 };
diff --git a/kernel/sched.c b/kernel/sched.c
index 361fad8..bc1a625 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3910,8 +3910,8 @@ EXPORT_SYMBOL(sleep_on_timeout);
  */
 void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-   int oldprio, on_rq, running;
unsigned long flags;
+   int oldprio, on_rq;
struct rq *rq;
 
BUG_ON(prio  0 || prio  MAX_PRIO);
@@ -3921,12 +3921,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 
oldprio = p-prio;
on_rq = p-se.on_rq;
-   running = task_running(rq, p);
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, p, 0);
-   if (running)
-   p-sched_class-put_prev_task(rq, p);
-   }
 
if (rt_prio(prio))
p-sched_class = rt_sched_class;
@@ -3936,15 +3932,13 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
p-prio = prio;
 
if (on_rq) {
-   if (running)
-   p-sched_class-set_curr_task(rq);
enqueue_task(rq, p, 0);
/*
 * Reschedule if we are currently running on this runqueue and
 * our priority decreased, or if we are not currently running on
 * this runqueue and our priority is higher than the current's
 */
-   if (running) {
+   if (task_running(rq, p)) {
if (p-prio  oldprio)
resched_task(rq-curr);
} else {
@@ -4150,7 +4144,7 @@ __setscheduler(struct rq *rq, struct task_struct *p, int 
policy, int prio)
 int sched_setscheduler(struct task_struct *p, int policy,
   struct sched_param *param)
 {
-   int retval, oldprio, oldpolicy = -1, on_rq, running;
+   int retval, oldprio, oldpolicy = -1, on_rq;
unsigned long flags;
struct rq *rq;
 
@@ -4232,24 +4226,20 @@ recheck:
}
update_rq_clock(rq);
on_rq = p-se.on_rq;
-   running = task_running(rq, p);
-   if (on_rq) {
+   if (on_rq)
deactivate_task(rq, p, 0);
-   if (running)
-   p-sched_class-put_prev_task(rq, p);
-   }
+
oldprio = p-prio;
__setscheduler(rq, p, policy, param-sched_priority);
+
if (on_rq) {
-   if (running)
-   p-sched_class-set_curr_task(rq);
activate_task(rq, p, 0);
/*
 * Reschedule if we are currently running on this runqueue and
 * our priority decreased, or if we are not currently running on
 * this runqueue and our priority is higher than the current's
 */
-   if (running) {
+   if (task_running(rq, p)) {
if (p-prio  oldprio)
resched_task(rq-curr);
} else {
@@ -6853,19 +6843,13 @@ static void sched_move_task(struct container_subsys 
*ss, struct container *cont,
running = task_running(rq, tsk);
on_rq = tsk-se.on_rq;
 
-   if (on_rq) {
+   if (on_rq)
dequeue_task(rq, tsk, 0);
-   if (unlikely(running))
-   tsk-sched_class-put_prev_task(rq, tsk);
-   }
 
set_task_cfs_rq(tsk);
 
-   if (on_rq) {
-   if (unlikely(running))
-   tsk-sched_class-set_curr_task(rq);
+   if (on_rq)
enqueue_task(rq, tsk, 0);
-   }
 
 done:
task_rq_unlock(rq, flags);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index e65af8c..6539377 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -475,9 +475,20 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int initial)
 }
 
 static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+   int wakeup, int set_curr)
 {
/*
+* In case of the 'current'.
+*/
+   if (unlikely(set_curr)) {
+   update_stats_curr_start(cfs_rq, se);
+   

Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Ingo Molnar

* dimm [EMAIL PROTECTED] wrote:

 [ well, don't expect to find here anything like RDCFS (no, 'D' does 
 not stand for 'dumb'!). I was focused on more prosaic things in the 
 mean time so just didn't have time for writing it.. ]
 
 here is a few cleanup/simplification/optimization(s) based on the 
 recent modifications in the sched-dev tree.
 
 (1) optimize task_new_fair()
 (2) simplify yield_task()
 (3) rework enqueue/dequeue_entity() to get rid of
 sched_class::set_curr_task()

the queue with your enhancements and simplifications applied looks good 
here, and it booted fine on two testboxes. I've updated the 
sched-devel.git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

(below is the full shortlog over current upstream.)

(I have not tested the group scheduling bits but perhaps Srivatsa would 
like to do that?)

Ingo

--
Dmitry Adamushko (9):
  sched: clean up struct load_stat
  sched: clean up schedstat block in dequeue_entity()
  sched: sched_setscheduler() fix
  sched: add set_curr_task() calls
  sched: do not keep current in the tree and get rid of 
sched_entity::fair_key
  sched: yield-workaround update
  sched: optimize task_new_fair()
  sched: simplify sched_class::yield_task()
  sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()

Ingo Molnar (29):
  sched: fix new-task method
  sched: resched task in task_new_fair()
  sched: small sched_debug cleanup
  sched: debug: track maximum 'slice'
  sched: uniform tunings
  sched: use constants if !CONFIG_SCHED_DEBUG
  sched: remove stat_gran
  sched: remove precise CPU load
  sched: remove precise CPU load calculations #2
  sched: track cfs_rq-curr on !group-scheduling too
  sched: cleanup: simplify cfs_rq_curr() methods
  sched: uninline __enqueue_entity()/__dequeue_entity()
  sched: speed up update_load_add/_sub()
  sched: clean up calc_weighted()
  sched: introduce se-vruntime
  sched: move sched_feat() definitions
  sched: optimize vruntime based scheduling
  sched: simplify check_preempt() methods
  sched: wakeup granularity fix
  sched: add se-vruntime debugging
  sched: sync up -min_vruntime when going idle
  sched: add more vruntime statistics
  sched: debug: update exec_clock only when SCHED_DEBUG
  sched: remove wait_runtime limit
  sched: remove wait_runtime fields and features
  sched: x86: allow single-depth wchan output
  sched: yield workaround
  sched: fix delay accounting performance regression
  sched: disable START_DEBIT

Matthias Kaehlcke (1):
  sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (1):
  sched: fix SMP migration latencies

Peter Zijlstra (7):
  sched: simplify SCHED_FEAT_* code
  sched: new task placement for vruntime
  sched: simplify adaptive latency
  sched: clean up new task placement
  sched: add tree based averages
  sched: handle vruntime overflow
  sched: better min_vruntime tracking

Srivatsa Vaddagiri (1):
  sched: group-scheduler core

 arch/i386/Kconfig |   11 
 include/linux/sched.h |   24 -
 init/Kconfig  |9 
 kernel/sched.c|  544 
 kernel/sched_debug.c  |  100 ++
 kernel/sched_fair.c   |  752 --
 kernel/sched_rt.c |4 
 kernel/sched_stats.h  |4 
 kernel/sysctl.c   |   35 +-
 9 files changed, 763 insertions(+), 720 deletions(-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Ingo Molnar

* Dmitry Adamushko [EMAIL PROTECTED] wrote:

 (2)
 
 the 'p' (task_struct) parameter in the sched_class :: yield_task() is 
 redundant as the caller is always the 'current'. Get rid of it.
 
 Signed-off-by: Dmitry Adamushko [EMAIL PROTECTED]

ah - good one! I completely forgot about that parameter and it's indeed 
unneeded. Applied.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Ingo Molnar

* dimm [EMAIL PROTECTED] wrote:

 (1)
 
 due to the fact that we no longer keep the 'current' within the tree, 
 dequeue/enqueue_entity() is useless for the 'current' in 
 task_new_fair(). We are about to reschedule and 
 sched_class-put_prev_task() will put the 'current' back into the 
 tree, based on its new key.
 
 Signed-off-by: Dmitry Adamushko [EMAIL PROTECTED]

thanks, applied.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Ingo Molnar

* Dmitry Adamushko [EMAIL PROTECTED] wrote:

 (3)
 
 rework enqueue/dequeue_entity() to get rid of 
 sched_class::set_curr_task(). This simplifies sched_setscheduler(), 
 rt_mutex_setprio() and sched_move_tasks().

ah. This makes your ready-queue patch a code size win. Applied.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git] CFS-devel, group scheduler, fixes

2007-09-18 Thread Srivatsa Vaddagiri
On Tue, Sep 18, 2007 at 10:22:43PM +0200, Ingo Molnar wrote:
 (I have not tested the group scheduling bits but perhaps Srivatsa would 
 like to do that?)

Ingo,
I plan to test it today and send you any updates that may be
required.

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[git] CFS-devel, group scheduler, fixes

2007-09-15 Thread Ingo Molnar

The latest sched-devel.git tree can be pulled from:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

people were busy sending patches, so there's lots of updates since the 
first announcement of the cfs-devel.git tree four days ago:

  include/linux/sched.h   |4 
  init/Kconfig|9 +
  kernel/sched.c  |  374 
+++-
  kernel/sched_debug.c|   22 +-
  kernel/sched_fair.c |  191 
  kernel/sched_idletask.c |5 
  kernel/sched_rt.c   |5 
  kernel/sysctl.c |   19 ++
  8 files changed, 552 insertions(+), 77 deletions(-)

the most user-noticeable improvement should be the latency/interactivity 
fixes from Mike Galbraith and Peter Zijlstra. The biggest (and most 
complex) item is the new group scheduler code from Srivatsa Vaddagiri. 
There's also speedups from Dmitry Adamushko and various cleanups and 
fixes. Also added in the yield workaround that should address the 
regression reported by Antoine Martin.

Changes:

- the biggest item is the addition of the group scheduler from Srivatsa 
  Vaddagiri - this is not configurable yet, it depends on
  CONFIG_CONTAINERS. It causes no overhead on !CONFIG_CONTAINERS. This
  code clearly seems mature now, hopefully the container bits go
  upstream in 2.6.24 too. Srivatsa did lots of heavy lifting in the past 
  few months, and this final bit of code that moves in all the 
  infrastructure changes almost nothing in the core scheduler.

- a triplet of nice simplifications from Dmitry Adamushko. Most notably
  Dmitry got rid of se->fair_key which shaves 8 bytes of task_struct and
  gives further speedup. Thanks Dmitry!

- continued refinements to the SMP ->vruntime code and timeslicing by 
  Peter Zijstra and Mike Galbraith.

As usual, bugreports, fixes and suggestions are welcome and please 
holler if some patch went missing in action.

Ingo

-->
Dmitry Adamushko (3):
  sched: clean up struct load_stat
  sched: clean up schedstat block in dequeue_entity()
  sched: optimize away ->fair_key

Matthias Kaehlcke (1):
  sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (1):
  sched: fix SMP migration latencies

Peter Zijlstra (7):
  sched: simplify SCHED_FEAT_* code
  sched: new task placement for vruntime
  sched: simplify adaptive latency
  sched: clean up new task placement
  sched: add tree based averages
  sched: handle vruntime overflow
  sched: better min_vruntime tracking

Srivatsa Vaddagiri (1):
  sched: group-scheduler core

Ingo Molnar (27):
  sched: fix new-task method
  sched: resched task in task_new_fair()
  sched: small sched_debug cleanup
  sched: debug: track maximum 'slice'
  sched: uniform tunings
  sched: use constants if !CONFIG_SCHED_DEBUG
  sched: remove stat_gran
  sched: remove precise CPU load
  sched: remove precise CPU load calculations #2
  sched: track cfs_rq->curr on !group-scheduling too
  sched: cleanup: simplify cfs_rq_curr() methods
  sched: uninline __enqueue_entity()/__dequeue_entity()
  sched: speed up update_load_add/_sub()
  sched: clean up calc_weighted()
  sched: introduce se->vruntime
  sched: move sched_feat() definitions
  sched: optimize vruntime based scheduling
  sched: simplify check_preempt() methods
  sched: wakeup granularity fix
  sched: add se->vruntime debugging
  sched: sync up ->min_vruntime when going idle
  sched: add more vruntime statistics
  sched: debug: update exec_clock only when SCHED_DEBUG
  sched: remove wait_runtime limit
  sched: remove wait_runtime fields and features
  sched: x86: allow single-depth wchan output
  sched: yield workaround

 arch/i386/Kconfig   |   11 
 include/linux/sched.h   |   21 -
 init/Kconfig|9 
 kernel/sched.c  |  558 +
 kernel/sched_debug.c|  100 +++---
 kernel/sched_fair.c |  716 +++-
 kernel/sched_idletask.c |5 
 kernel/sched_rt.c   |5 
 kernel/sysctl.c |   33 +-
 9 files changed, 765 insertions(+), 693 deletions(-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[git] CFS-devel, group scheduler, fixes

2007-09-15 Thread Ingo Molnar

The latest sched-devel.git tree can be pulled from:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

people were busy sending patches, so there's lots of updates since the 
first announcement of the cfs-devel.git tree four days ago:

  include/linux/sched.h   |4 
  init/Kconfig|9 +
  kernel/sched.c  |  374 
+++-
  kernel/sched_debug.c|   22 +-
  kernel/sched_fair.c |  191 
  kernel/sched_idletask.c |5 
  kernel/sched_rt.c   |5 
  kernel/sysctl.c |   19 ++
  8 files changed, 552 insertions(+), 77 deletions(-)

the most user-noticeable improvement should be the latency/interactivity 
fixes from Mike Galbraith and Peter Zijlstra. The biggest (and most 
complex) item is the new group scheduler code from Srivatsa Vaddagiri. 
There's also speedups from Dmitry Adamushko and various cleanups and 
fixes. Also added in the yield workaround that should address the 
regression reported by Antoine Martin.

Changes:

- the biggest item is the addition of the group scheduler from Srivatsa 
  Vaddagiri - this is not configurable yet, it depends on
  CONFIG_CONTAINERS. It causes no overhead on !CONFIG_CONTAINERS. This
  code clearly seems mature now, hopefully the container bits go
  upstream in 2.6.24 too. Srivatsa did lots of heavy lifting in the past 
  few months, and this final bit of code that moves in all the 
  infrastructure changes almost nothing in the core scheduler.

- a triplet of nice simplifications from Dmitry Adamushko. Most notably
  Dmitry got rid of se-fair_key which shaves 8 bytes of task_struct and
  gives further speedup. Thanks Dmitry!

- continued refinements to the SMP -vruntime code and timeslicing by 
  Peter Zijstra and Mike Galbraith.

As usual, bugreports, fixes and suggestions are welcome and please 
holler if some patch went missing in action.

Ingo

--
Dmitry Adamushko (3):
  sched: clean up struct load_stat
  sched: clean up schedstat block in dequeue_entity()
  sched: optimize away -fair_key

Matthias Kaehlcke (1):
  sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (1):
  sched: fix SMP migration latencies

Peter Zijlstra (7):
  sched: simplify SCHED_FEAT_* code
  sched: new task placement for vruntime
  sched: simplify adaptive latency
  sched: clean up new task placement
  sched: add tree based averages
  sched: handle vruntime overflow
  sched: better min_vruntime tracking

Srivatsa Vaddagiri (1):
  sched: group-scheduler core

Ingo Molnar (27):
  sched: fix new-task method
  sched: resched task in task_new_fair()
  sched: small sched_debug cleanup
  sched: debug: track maximum 'slice'
  sched: uniform tunings
  sched: use constants if !CONFIG_SCHED_DEBUG
  sched: remove stat_gran
  sched: remove precise CPU load
  sched: remove precise CPU load calculations #2
  sched: track cfs_rq-curr on !group-scheduling too
  sched: cleanup: simplify cfs_rq_curr() methods
  sched: uninline __enqueue_entity()/__dequeue_entity()
  sched: speed up update_load_add/_sub()
  sched: clean up calc_weighted()
  sched: introduce se-vruntime
  sched: move sched_feat() definitions
  sched: optimize vruntime based scheduling
  sched: simplify check_preempt() methods
  sched: wakeup granularity fix
  sched: add se-vruntime debugging
  sched: sync up -min_vruntime when going idle
  sched: add more vruntime statistics
  sched: debug: update exec_clock only when SCHED_DEBUG
  sched: remove wait_runtime limit
  sched: remove wait_runtime fields and features
  sched: x86: allow single-depth wchan output
  sched: yield workaround

 arch/i386/Kconfig   |   11 
 include/linux/sched.h   |   21 -
 init/Kconfig|9 
 kernel/sched.c  |  558 +
 kernel/sched_debug.c|  100 +++---
 kernel/sched_fair.c |  716 +++-
 kernel/sched_idletask.c |5 
 kernel/sched_rt.c   |5 
 kernel/sysctl.c |   33 +-
 9 files changed, 765 insertions(+), 693 deletions(-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/