from:"Con Kolivas"

[PATCH] sched: implement staircase deadline cpu scheduler improvements fix

2007-04-04 Thread Con Kolivas

On Wednesday 04 April 2007 09:31, Michal Piotrowski wrote:
 Con Kolivas napisał(a):
  On Wednesday 04 April 2007 08:20, Michal Piotrowski wrote:
  Michal Piotrowski napisał(a):
  http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc5-mm4
 /m m-oops
  http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc5-mm4
 /m m-config

  Con, I think that your
  sched-implement-staircase-deadline-cpu-scheduler-staircase-improvements.
 pat ch is causing this oops.
 
  Thanks for heads up!

Confirmed offline with Michal that the following patch fixes it. Thanks!

This should also make nice work better in the way the previous patch intended
it to.

---
Use of memset was bogus. Fix it.

Fix exiting recalc_task_prio without p-array being updated.

Microoptimisation courtesy of Dmitry Adamushko [EMAIL PROTECTED]

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 kernel/sched.c |   17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

Index: linux-2.6.21-rc5-mm4/kernel/sched.c
===
--- linux-2.6.21-rc5-mm4.orig/kernel/sched.c2007-04-04 12:14:29.0 
+1000
+++ linux-2.6.21-rc5-mm4/kernel/sched.c 2007-04-04 12:49:39.0 +1000
@@ -683,11 +683,13 @@ static void dequeue_task(struct task_str
  * The task is being queued on a fresh array so it has its entitlement
  * bitmap cleared.
  */
-static inline void task_new_array(struct task_struct *p, struct rq *rq)
+static void task_new_array(struct task_struct *p, struct rq *rq,
+  struct prio_array *array)
 {
bitmap_zero(p-bitmap, PRIO_RANGE);
p-rotation = rq-prio_rotation;
p-time_slice = p-quota;
+   p-array = array;
 }
 
 /* Find the first slot from the relevant prio_matrix entry */
@@ -709,6 +711,8 @@ static inline int next_entitled_slot(str
DECLARE_BITMAP(tmp, PRIO_RANGE);
int search_prio, uprio = USER_PRIO(p-static_prio);
 
+   if (!rq-prio_level[uprio])
+   rq-prio_level[uprio] = MAX_RT_PRIO;
/*
 * Only priorities equal to the prio_level and above for their
 * static_prio are acceptable, and only if it's not better than
@@ -736,11 +740,8 @@ static inline int next_entitled_slot(str
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
 {
-   p-array = rq-expired;
-   task_new_array(p, rq);
+   task_new_array(p, rq, rq-expired);
p-prio = p-normal_prio = first_prio_slot(p);
-   p-time_slice = p-quota;
-   p-rotation = rq-prio_rotation;
 }
 
 #ifdef CONFIG_SMP
@@ -800,9 +801,9 @@ static void recalc_task_prio(struct task
queue_expired(p, rq);
return;
} else
-   task_new_array(p, rq);
+   task_new_array(p, rq, array);
} else
-   task_new_array(p, rq);
+   task_new_array(p, rq, array);
 
queue_prio = next_entitled_slot(p, rq);
if (queue_prio = MAX_PRIO) {
@@ -3445,7 +3446,7 @@ EXPORT_SYMBOL(sub_preempt_count);
 
 static inline void reset_prio_levels(struct rq *rq)
 {
-   memset(rq-prio_level, MAX_RT_PRIO, ARRAY_SIZE(rq-prio_level));
+   memset(rq-prio_level, 0, sizeof(int) * PRIO_RANGE);
 }
 
 /*

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc5-mm4

2007-04-04 Thread Con Kolivas

On Thursday 05 April 2007 08:10, Andrew Morton wrote:
 Thanks - that'll be the CPU scheduler changes.

 Con has produced a patch or two which might address this but afaik we don't
 yet have a definitive fix?

 I believe that reverting
 sched-implement-staircase-deadline-cpu-scheduler-staircase-improvements.pat
ch will prevent it.

I posted a definitive fix which Michal tested for me offlist. Subject was:
 [PATCH] sched: implement staircase deadline cpu scheduler improvements fix

Sorry about relative noise prior to that. Akpm please pick it up.

Here again just in case.

---
Use of memset was bogus. Fix it.

Fix exiting recalc_task_prio without p-array being updated.

Microoptimisation courtesy of Dmitry Adamushko [EMAIL PROTECTED]

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 kernel/sched.c |   17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

Index: linux-2.6.21-rc5-mm4/kernel/sched.c
===
--- linux-2.6.21-rc5-mm4.orig/kernel/sched.c2007-04-04 12:14:29.0 
+1000
+++ linux-2.6.21-rc5-mm4/kernel/sched.c 2007-04-04 12:49:39.0 +1000
@@ -683,11 +683,13 @@ static void dequeue_task(struct task_str
  * The task is being queued on a fresh array so it has its entitlement
  * bitmap cleared.
  */
-static inline void task_new_array(struct task_struct *p, struct rq *rq)
+static void task_new_array(struct task_struct *p, struct rq *rq,
+  struct prio_array *array)
 {
bitmap_zero(p-bitmap, PRIO_RANGE);
p-rotation = rq-prio_rotation;
p-time_slice = p-quota;
+   p-array = array;
 }
 
 /* Find the first slot from the relevant prio_matrix entry */
@@ -709,6 +711,8 @@ static inline int next_entitled_slot(str
DECLARE_BITMAP(tmp, PRIO_RANGE);
int search_prio, uprio = USER_PRIO(p-static_prio);
 
+   if (!rq-prio_level[uprio])
+   rq-prio_level[uprio] = MAX_RT_PRIO;
/*
 * Only priorities equal to the prio_level and above for their
 * static_prio are acceptable, and only if it's not better than
@@ -736,11 +740,8 @@ static inline int next_entitled_slot(str
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
 {
-   p-array = rq-expired;
-   task_new_array(p, rq);
+   task_new_array(p, rq, rq-expired);
p-prio = p-normal_prio = first_prio_slot(p);
-   p-time_slice = p-quota;
-   p-rotation = rq-prio_rotation;
 }
 
 #ifdef CONFIG_SMP
@@ -800,9 +801,9 @@ static void recalc_task_prio(struct task
queue_expired(p, rq);
return;
} else
-   task_new_array(p, rq);
+   task_new_array(p, rq, array);
} else
-   task_new_array(p, rq);
+   task_new_array(p, rq, array);
 
queue_prio = next_entitled_slot(p, rq);
if (queue_prio = MAX_PRIO) {
@@ -3445,7 +3446,7 @@ EXPORT_SYMBOL(sub_preempt_count);
 
 static inline void reset_prio_levels(struct rq *rq)
 {
-   memset(rq-prio_level, MAX_RT_PRIO, ARRAY_SIZE(rq-prio_level));
+   memset(rq-prio_level, 0, sizeof(int) * PRIO_RANGE);
 }
 
 /*

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc5-mm4

2007-04-03 Thread Con Kolivas

On Wednesday 04 April 2007 08:20, Michal Piotrowski wrote:
> Michal Piotrowski napisał(a):
> > http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc5-mm4/m
> >m-oops
> > http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc5-mm4/m
> >m-config
>
> Sorry for a delay.

Never apologise! 

(I'm trying hard to stay offline for my own health so I may have huge delays).

> Con, I think that your
> sched-implement-staircase-deadline-cpu-scheduler-staircase-improvements.pat
>ch is causing this oops.

Thanks for heads up!

Try this patch please?
---
Fix exiting recalc_task_prio without p->array being updated.

Microoptimisation courtesy of Dmitry Adamushko <[EMAIL PROTECTED]>

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 kernel/sched.c |   13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc5-mm3/kernel/sched.c
===
--- linux-2.6.21-rc5-mm3.orig/kernel/sched.c2007-04-04 08:23:45.0 
+1000
+++ linux-2.6.21-rc5-mm3/kernel/sched.c 2007-04-04 08:25:39.0 +1000
@@ -683,11 +683,13 @@ static void dequeue_task(struct task_str
  * The task is being queued on a fresh array so it has its entitlement
  * bitmap cleared.
  */
-static inline void task_new_array(struct task_struct *p, struct rq *rq)
+static void task_new_array(struct task_struct *p, struct rq *rq,
+  struct prio_array *array)
 {
bitmap_zero(p->bitmap, PRIO_RANGE);
p->rotation = rq->prio_rotation;
p->time_slice = p->quota;
+   p->array = array;
 }
 
 /* Find the first slot from the relevant prio_matrix entry */
@@ -736,11 +738,8 @@ static inline int next_entitled_slot(str
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
 {
-   p->array = rq->expired;
-   task_new_array(p, rq);
+   task_new_array(p, rq, rq->expired);
p->prio = p->normal_prio = first_prio_slot(p);
-   p->time_slice = p->quota;
-   p->rotation = rq->prio_rotation;
 }
 
 #ifdef CONFIG_SMP
@@ -800,9 +799,9 @@ static void recalc_task_prio(struct task
queue_expired(p, rq);
return;
} else
-   task_new_array(p, rq);
+   task_new_array(p, rq, array);
} else
-   task_new_array(p, rq);
+   task_new_array(p, rq, array);
 
queue_prio = next_entitled_slot(p, rq);
if (queue_prio >= MAX_PRIO) {

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc5-mm4

2007-04-03 Thread Con Kolivas

On Wednesday 04 April 2007 08:20, Michal Piotrowski wrote:
 Michal Piotrowski napisał(a):
  http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc5-mm4/m
 m-oops
  http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc5-mm4/m
 m-config

 Sorry for a delay.

Never apologise! 

(I'm trying hard to stay offline for my own health so I may have huge delays).

 Con, I think that your
 sched-implement-staircase-deadline-cpu-scheduler-staircase-improvements.pat
ch is causing this oops.

Thanks for heads up!

Try this patch please?
---
Fix exiting recalc_task_prio without p-array being updated.

Microoptimisation courtesy of Dmitry Adamushko [EMAIL PROTECTED]

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 kernel/sched.c |   13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc5-mm3/kernel/sched.c
===
--- linux-2.6.21-rc5-mm3.orig/kernel/sched.c2007-04-04 08:23:45.0 
+1000
+++ linux-2.6.21-rc5-mm3/kernel/sched.c 2007-04-04 08:25:39.0 +1000
@@ -683,11 +683,13 @@ static void dequeue_task(struct task_str
  * The task is being queued on a fresh array so it has its entitlement
  * bitmap cleared.
  */
-static inline void task_new_array(struct task_struct *p, struct rq *rq)
+static void task_new_array(struct task_struct *p, struct rq *rq,
+  struct prio_array *array)
 {
bitmap_zero(p-bitmap, PRIO_RANGE);
p-rotation = rq-prio_rotation;
p-time_slice = p-quota;
+   p-array = array;
 }
 
 /* Find the first slot from the relevant prio_matrix entry */
@@ -736,11 +738,8 @@ static inline int next_entitled_slot(str
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
 {
-   p-array = rq-expired;
-   task_new_array(p, rq);
+   task_new_array(p, rq, rq-expired);
p-prio = p-normal_prio = first_prio_slot(p);
-   p-time_slice = p-quota;
-   p-rotation = rq-prio_rotation;
 }
 
 #ifdef CONFIG_SMP
@@ -800,9 +799,9 @@ static void recalc_task_prio(struct task
queue_expired(p, rq);
return;
} else
-   task_new_array(p, rq);
+   task_new_array(p, rq, array);
} else
-   task_new_array(p, rq);
+   task_new_array(p, rq, array);
 
queue_prio = next_entitled_slot(p, rq);
if (queue_prio = MAX_PRIO) {

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: staircase deadline misc fixes

2007-04-02 Thread Con Kolivas

On Thursday 29 March 2007 15:50, Mike Galbraith wrote:
> On Thu, 2007-03-29 at 09:44 +1000, Con Kolivas wrote:
> + * This contains a bitmap for each dynamic priority level with empty slots
> + * for the valid priorities each different nice level can have. It allows
> + * us to stagger the slots where differing priorities run in a way that
> + * keeps latency differences between different nice levels at a minimum.
> + * ie, where 0 means a slot for that priority, priority running from left
> to + * right:
> + * nice -20 
> + * nice -10 1001000100100010001001000100010010001000
> + * nice   0 0101010101010101010101010101010101010101
> + * nice   5 1101011010110101101011010110101101011011
> + * nice  10 0110111011011101110110111011101101110111
> + * nice  15 0101101101011011
> + * nice  19 1110

Try two instances of chew.c at _differing_ nice levels on one cpu on mainline, 
and then SD. This is why you can't renice X on mainline.

>   -Mike

-- 
-ck
/*
 * orignal idea by Chris Friesen.  Thanks.
 */

#include 
#include 
#include 

#define THRESHOLD_USEC 2000

unsigned long long stamp()
{
struct timeval tv;
gettimeofday(, 0);
return (unsigned long long) tv.tv_usec + ((unsigned long long) tv.tv_sec)*100;
}

int main()
{
unsigned long long thresh_ticks = THRESHOLD_USEC;
unsigned long long cur,last;
struct timespec ts;

sched_rr_get_interval(0, );
printf("pid %d, prio %3d, interval of %d nsec\n", getpid(), getpriority(PRIO_PROCESS, 0), ts.tv_nsec);

last = stamp();
while(1) {
cur = stamp();
unsigned long long delta = cur-last;
if (delta > thresh_ticks) {
printf("pid %d, prio %3d, out for %4llu ms\n", getpid(), getpriority(PRIO_PROCESS, 0), delta/1000);
cur = stamp();
}
last = cur;
}

return 0;
}

Re: [test] hackbench.c interactivity results: vanilla versus SD/RSDL

2007-04-02 Thread Con Kolivas

On Saturday 31 March 2007 19:28, Xenofon Antidides wrote:
> For long time now I use windows to work 
> problems. I cannot play wine games with audio, I
> cannot sample video, I cannot use skype, I cannot play
> midi. And even linux only things I try do I cannot
> share my X, I cannot use more than one vmware. All
> those is fix for me with SD.

Any semblance of cpu bandwidth and latency guarantees are easily shot on 
mainline by a single process going wild (eg open tab in firefox).

> I sorry I answer kernel 
> email and go away now for good.

respected; dropped from cc

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: staircase deadline misc fixes

2007-04-02 Thread Con Kolivas

On Thursday 29 March 2007 18:18, Mike Galbraith wrote:
> Rereading to make sure I wasn't unclear anywhere...
>
> On Thu, 2007-03-29 at 07:50 +0200, Mike Galbraith wrote:
> > I don't see what a < 95% load really means.
>
> Egad.  Here I'm pondering the numbers and light load as I'm typing, and
> my fingers (seemingly independent when mind wanders off) typed < 95% as
> in not fully committed, instead of "light".

95% of cases where load is less than 4; not 95% load.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [test] hackbench.c interactivity results: vanilla versus SD/RSDL

2007-04-02 Thread Con Kolivas

On Thursday 29 March 2007 21:22, Ingo Molnar wrote:
> [ A quick guess: could SD's substandard interactivity in this test be
>   due to the SMP migration logic inconsistencies Mike noticed? This is
>   an SMP system and the hackbench workload is very scheduling intense
>   and tasks are frequently queued from one CPU to another. ]

I assume you put it on and endless loop since hackbench 10 runs for .5 second 
on my machine. Doubtful it's an SMP issue. update_if_moved should maintain 
cross cpu scheduling decisions. The same slowdown would happen on UP and is 
almost certainly due to the fact that hackbench 10 induces a load of _160_ on 
the machine.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sched: staircase deadline improvements

2007-04-02 Thread Con Kolivas

Staircase Deadline improvements.

Nice is better distributed for waking tasks with a per-static-prio prio_level.

SCHED_RR tasks were not being requeued on expiration.

Tighten up accounting.

Fix comment style.

Microoptimisation courtesy of Dmitry Adamushko <[EMAIL PROTECTED]>

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 kernel/sched.c |   97 +++--
 1 file changed, 60 insertions(+), 37 deletions(-)

Index: linux-2.6.21-rc5-mm3/kernel/sched.c
===
--- linux-2.6.21-rc5-mm3.orig/kernel/sched.c2007-04-02 10:37:07.0 
+1000
+++ linux-2.6.21-rc5-mm3/kernel/sched.c 2007-04-03 10:40:48.0 +1000
@@ -132,20 +132,20 @@ struct rq;
  * These are the runqueue data structures:
  */
 struct prio_array {
-   struct list_head queue[MAX_PRIO];
/* Tasks queued at each priority */
+   struct list_head queue[MAX_PRIO];
 
-   DECLARE_BITMAP(prio_bitmap, MAX_PRIO + 1);
/*
 * The bitmap of priorities queued for this array. While the expired
 * array will never have realtime tasks on it, it is simpler to have
 * equal sized bitmaps for a cheap array swap. Include 1 bit for
 * delimiter.
 */
+   DECLARE_BITMAP(prio_bitmap, MAX_PRIO + 1);
 
 #ifdef CONFIG_SMP
-   struct rq *rq;
/* For convenience looks back at rq */
+   struct rq *rq;
 #endif
 };
 
@@ -212,14 +212,14 @@ struct rq {
struct prio_array *active, *expired, arrays[2];
unsigned long *dyn_bitmap, *exp_bitmap;
 
-   int prio_level, best_static_prio;
/*
-* The current dynamic priority level this runqueue is at, and the
-* best static priority queued this major rotation.
+* The current dynamic priority level this runqueue is at per static
+* priority level, and the best static priority queued this rotation.
 */
+   int prio_level[PRIO_RANGE], best_static_prio;
 
-   unsigned long prio_rotation;
/* How many times we have rotated the priority queue */
+   unsigned long prio_rotation;
 
atomic_t nr_iowait;
 
@@ -707,19 +707,29 @@ static inline int first_prio_slot(struct
 static inline int next_entitled_slot(struct task_struct *p, struct rq *rq)
 {
DECLARE_BITMAP(tmp, PRIO_RANGE);
-   int search_prio;
+   int search_prio, uprio = USER_PRIO(p->static_prio);
 
-   if (p->static_prio < rq->best_static_prio)
+   /*
+* Only priorities equal to the prio_level and above for their
+* static_prio are acceptable, and only if it's not better than
+* a queued better static_prio's prio_level.
+*/
+   if (p->static_prio < rq->best_static_prio) {
search_prio = MAX_RT_PRIO;
-   else
-   search_prio = rq->prio_level;
+   if (likely(p->policy != SCHED_BATCH))
+   rq->best_static_prio = p->static_prio;
+   } else if (p->static_prio == rq->best_static_prio)
+   search_prio = rq->prio_level[uprio];
+   else {
+   search_prio = max(rq->prio_level[uprio],
+   rq->prio_level[USER_PRIO(rq->best_static_prio)]);
+   }
if (unlikely(p->policy == SCHED_BATCH)) {
search_prio = max(search_prio, p->static_prio);
return SCHED_PRIO(find_next_zero_bit(p->bitmap, PRIO_RANGE,
  USER_PRIO(search_prio)));
}
-   bitmap_or(tmp, p->bitmap, prio_matrix[USER_PRIO(p->static_prio)],
- PRIO_RANGE);
+   bitmap_or(tmp, p->bitmap, prio_matrix[uprio], PRIO_RANGE);
return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
USER_PRIO(search_prio)));
 }
@@ -745,14 +755,18 @@ static void queue_expired(struct task_st
 
if (src_rq == rq)
return;
-   if (p->rotation == src_rq->prio_rotation)
+   /*
+* Only need to set p->array when p->rotation == rq->prio_rotation as
+* they will be set in recalc_task_prio when != rq->prio_rotation.
+*/
+   if (p->rotation == src_rq->prio_rotation) {
p->rotation = rq->prio_rotation;
-   else
+   if (p->array == src_rq->expired)
+   p->array = rq->expired;
+   else
+   p->array = rq->active;
+   } else
p->rotation = 0;
-   if (p->array == src_rq->expired)
-   p->array = rq->expired;
-   else
-   p->array = rq->active;
 }
 #else
 static inline void update_if_moved(struct task_struct *p, struct rq *rq)
@@ -1671,16 +1685,16 @@ void fastcall sched_fork(struct task_str
 * total amount of pending timeslices in the system doesn't change,

[PATCH] sched: staircase deadline improvements

2007-04-02 Thread Con Kolivas

Staircase Deadline improvements.

Nice is better distributed for waking tasks with a per-static-prio prio_level.

SCHED_RR tasks were not being requeued on expiration.

Tighten up accounting.

Fix comment style.

Microoptimisation courtesy of Dmitry Adamushko [EMAIL PROTECTED]

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 kernel/sched.c |   97 +++--
 1 file changed, 60 insertions(+), 37 deletions(-)

Index: linux-2.6.21-rc5-mm3/kernel/sched.c
===
--- linux-2.6.21-rc5-mm3.orig/kernel/sched.c2007-04-02 10:37:07.0 
+1000
+++ linux-2.6.21-rc5-mm3/kernel/sched.c 2007-04-03 10:40:48.0 +1000
@@ -132,20 +132,20 @@ struct rq;
  * These are the runqueue data structures:
  */
 struct prio_array {
-   struct list_head queue[MAX_PRIO];
/* Tasks queued at each priority */
+   struct list_head queue[MAX_PRIO];
 
-   DECLARE_BITMAP(prio_bitmap, MAX_PRIO + 1);
/*
 * The bitmap of priorities queued for this array. While the expired
 * array will never have realtime tasks on it, it is simpler to have
 * equal sized bitmaps for a cheap array swap. Include 1 bit for
 * delimiter.
 */
+   DECLARE_BITMAP(prio_bitmap, MAX_PRIO + 1);
 
 #ifdef CONFIG_SMP
-   struct rq *rq;
/* For convenience looks back at rq */
+   struct rq *rq;
 #endif
 };
 
@@ -212,14 +212,14 @@ struct rq {
struct prio_array *active, *expired, arrays[2];
unsigned long *dyn_bitmap, *exp_bitmap;
 
-   int prio_level, best_static_prio;
/*
-* The current dynamic priority level this runqueue is at, and the
-* best static priority queued this major rotation.
+* The current dynamic priority level this runqueue is at per static
+* priority level, and the best static priority queued this rotation.
 */
+   int prio_level[PRIO_RANGE], best_static_prio;
 
-   unsigned long prio_rotation;
/* How many times we have rotated the priority queue */
+   unsigned long prio_rotation;
 
atomic_t nr_iowait;
 
@@ -707,19 +707,29 @@ static inline int first_prio_slot(struct
 static inline int next_entitled_slot(struct task_struct *p, struct rq *rq)
 {
DECLARE_BITMAP(tmp, PRIO_RANGE);
-   int search_prio;
+   int search_prio, uprio = USER_PRIO(p-static_prio);
 
-   if (p-static_prio  rq-best_static_prio)
+   /*
+* Only priorities equal to the prio_level and above for their
+* static_prio are acceptable, and only if it's not better than
+* a queued better static_prio's prio_level.
+*/
+   if (p-static_prio  rq-best_static_prio) {
search_prio = MAX_RT_PRIO;
-   else
-   search_prio = rq-prio_level;
+   if (likely(p-policy != SCHED_BATCH))
+   rq-best_static_prio = p-static_prio;
+   } else if (p-static_prio == rq-best_static_prio)
+   search_prio = rq-prio_level[uprio];
+   else {
+   search_prio = max(rq-prio_level[uprio],
+   rq-prio_level[USER_PRIO(rq-best_static_prio)]);
+   }
if (unlikely(p-policy == SCHED_BATCH)) {
search_prio = max(search_prio, p-static_prio);
return SCHED_PRIO(find_next_zero_bit(p-bitmap, PRIO_RANGE,
  USER_PRIO(search_prio)));
}
-   bitmap_or(tmp, p-bitmap, prio_matrix[USER_PRIO(p-static_prio)],
- PRIO_RANGE);
+   bitmap_or(tmp, p-bitmap, prio_matrix[uprio], PRIO_RANGE);
return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
USER_PRIO(search_prio)));
 }
@@ -745,14 +755,18 @@ static void queue_expired(struct task_st
 
if (src_rq == rq)
return;
-   if (p-rotation == src_rq-prio_rotation)
+   /*
+* Only need to set p-array when p-rotation == rq-prio_rotation as
+* they will be set in recalc_task_prio when != rq-prio_rotation.
+*/
+   if (p-rotation == src_rq-prio_rotation) {
p-rotation = rq-prio_rotation;
-   else
+   if (p-array == src_rq-expired)
+   p-array = rq-expired;
+   else
+   p-array = rq-active;
+   } else
p-rotation = 0;
-   if (p-array == src_rq-expired)
-   p-array = rq-expired;
-   else
-   p-array = rq-active;
 }
 #else
 static inline void update_if_moved(struct task_struct *p, struct rq *rq)
@@ -1671,16 +1685,16 @@ void fastcall sched_fork(struct task_str
 * total amount of pending timeslices in the system doesn't change,
 * resulting in more scheduling fairness.
 */
-   if (unlikely(p-time_slice  2))
-   p-time_slice = 2;
-   p-time_slice = current-time_slice  1;
+   local_irq_disable

Re: [test] hackbench.c interactivity results: vanilla versus SD/RSDL

2007-04-02 Thread Con Kolivas

On Thursday 29 March 2007 21:22, Ingo Molnar wrote:
 [ A quick guess: could SD's substandard interactivity in this test be
   due to the SMP migration logic inconsistencies Mike noticed? This is
   an SMP system and the hackbench workload is very scheduling intense
   and tasks are frequently queued from one CPU to another. ]

I assume you put it on and endless loop since hackbench 10 runs for .5 second 
on my machine. Doubtful it's an SMP issue. update_if_moved should maintain 
cross cpu scheduling decisions. The same slowdown would happen on UP and is 
almost certainly due to the fact that hackbench 10 induces a load of _160_ on 
the machine.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [test] hackbench.c interactivity results: vanilla versus SD/RSDL

2007-04-02 Thread Con Kolivas

On Saturday 31 March 2007 19:28, Xenofon Antidides wrote:
 For long time now I use windows to work 
 problems. I cannot play wine games with audio, I
 cannot sample video, I cannot use skype, I cannot play
 midi. And even linux only things I try do I cannot
 share my X, I cannot use more than one vmware. All
 those is fix for me with SD.

Any semblance of cpu bandwidth and latency guarantees are easily shot on 
mainline by a single process going wild (eg open tab in firefox).

 I sorry I answer kernel 
 email and go away now for good.

respected; dropped from cc

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: staircase deadline misc fixes

2007-04-02 Thread Con Kolivas

On Thursday 29 March 2007 18:18, Mike Galbraith wrote:
 Rereading to make sure I wasn't unclear anywhere...

 On Thu, 2007-03-29 at 07:50 +0200, Mike Galbraith wrote:
  I don't see what a  95% load really means.

 Egad.  Here I'm pondering the numbers and light load as I'm typing, and
 my fingers (seemingly independent when mind wanders off) typed  95% as
 in not fully committed, instead of light.

95% of cases where load is less than 4; not 95% load.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: staircase deadline misc fixes

2007-04-02 Thread Con Kolivas

On Thursday 29 March 2007 15:50, Mike Galbraith wrote:
 On Thu, 2007-03-29 at 09:44 +1000, Con Kolivas wrote:
 + * This contains a bitmap for each dynamic priority level with empty slots
 + * for the valid priorities each different nice level can have. It allows
 + * us to stagger the slots where differing priorities run in a way that
 + * keeps latency differences between different nice levels at a minimum.
 + * ie, where 0 means a slot for that priority, priority running from left
 to + * right:
 + * nice -20 
 + * nice -10 1001000100100010001001000100010010001000
 + * nice   0 0101010101010101010101010101010101010101
 + * nice   5 1101011010110101101011010110101101011011
 + * nice  10 0110111011011101110110111011101101110111
 + * nice  15 0101101101011011
 + * nice  19 1110

Try two instances of chew.c at _differing_ nice levels on one cpu on mainline, 
and then SD. This is why you can't renice X on mainline.

   -Mike

-- 
-ck
/*
 * orignal idea by Chris Friesen.  Thanks.
 */

#include stdio.h
#include sys/time.h
#include sys/resource.h

#define THRESHOLD_USEC 2000

unsigned long long stamp()
{
struct timeval tv;
gettimeofday(tv, 0);
return (unsigned long long) tv.tv_usec + ((unsigned long long) tv.tv_sec)*100;
}

int main()
{
unsigned long long thresh_ticks = THRESHOLD_USEC;
unsigned long long cur,last;
struct timespec ts;

sched_rr_get_interval(0, ts);
printf(pid %d, prio %3d, interval of %d nsec\n, getpid(), getpriority(PRIO_PROCESS, 0), ts.tv_nsec);

last = stamp();
while(1) {
cur = stamp();
unsigned long long delta = cur-last;
if (delta  thresh_ticks) {
printf(pid %d, prio %3d, out for %4llu ms\n, getpid(), getpriority(PRIO_PROCESS, 0), delta/1000);
cur = stamp();
}
last = cur;
}

return 0;
}

Re: [PATCH] sched: staircase deadline misc fixes

2007-03-28 Thread Con Kolivas

On Thursday 29 March 2007 02:37, Con Kolivas wrote:
> I'm cautiously optimistic that we're at the thin edge of the bugfix wedge
> now.

My neck condition got a lot worse today. I'm forced offline for a week and 
will be uncontactable.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: staircase deadline misc fixes

2007-03-28 Thread Con Kolivas

On Thursday 29 March 2007 04:48, Ingo Molnar wrote:
> hm, how about the questions Mike raised (there were a couple of cases of
> friction between 'the design as documented and announced' and 'the code
> as implemented')? As far as i saw they were still largely unanswered -
> but let me know if they are all answered and addressed:

I spent less time emailing and more time coding. I have been working on 
addressing whatever people brought up.

>  http://marc.info/?l=linux-kernel=117465220309006=2

Attended to.

>  http://marc.info/?l=linux-kernel=117489673929124=2

Attended to.

>  http://marc.info/?l=linux-kernel=117489831930240=2

Checked fine.

> and the numbers he posted:
>
>  http://marc.info/?l=linux-kernel=117448900626028=2

Attended to.

> his test conclusion was that under CPU load, RSDL (SD) generally does
> not hold up to mainline's interactivity.

There have been improvements since the earlier iterations but it's still a 
fairness based design. Mike's "sticking point" test case should be improved 
as well.

My call based on my own testing and feedback from users is: 

Under niced loads it is 99% in favour of SD.

Under light loads it is 95% in favour of SD.

Under Heavy loads it becomes proportionately in favour of mainline. The 
crossover is somewhere around a load of 4.

If the reluctance to renice X goes away I'd say it was 99% across the board 
and to much higher loads.

>   Ingo

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sched: staircase deadline misc fixes

2007-03-28 Thread Con Kolivas

test.kernel.org found some idle time regressions in the latest update to the
staircase deadline scheduler and Andy Whitcroft helped me track down the 
offending problem which was present in all previous RSDL schedulers but
previously wouldn't be manifest without changes in nice. So here is a bugfix
for the set_load_weight being incorrectly set and a few other minor 
improvements. Thanks Andy!

I'm cautiously optimistic that we're at the thin edge of the bugfix wedge now.

---
set_load_weight() should be performed after p->quota is set. This fixes a
large SMP performance regression.

Make sure rr_interval is never set to less than one jiffy.

Some sanity checking in update_cpu_clock will prevent bogus sched_clock
values.

SCHED_BATCH tasks should not set the rq->best_static_prio field.

Correct sysctl rr_interval description to describe the value in milliseconds.

Style fixes.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 Documentation/sysctl/kernel.txt |8 ++--
 kernel/sched.c  |   73 +---
 2 files changed, 58 insertions(+), 23 deletions(-)

Index: linux-2.6.21-rc5-mm2/kernel/sched.c
===
--- linux-2.6.21-rc5-mm2.orig/kernel/sched.c2007-03-28 09:01:03.0 
+1000
+++ linux-2.6.21-rc5-mm2/kernel/sched.c 2007-03-29 00:02:33.0 +1000
@@ -88,10 +88,13 @@ unsigned long long __attribute__((weak))
 #define MAX_USER_PRIO  (USER_PRIO(MAX_PRIO))
 #define SCHED_PRIO(p)  ((p)+MAX_RT_PRIO)
 
-/* Some helpers for converting to/from nanosecond timing */
+/* Some helpers for converting to/from various scales.*/
 #define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
-#define NS_TO_MS(TIME) ((TIME) / 100)
+#define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
 #define MS_TO_NS(TIME) ((TIME) * 100)
+/* Can return 0 */
+#define MS_TO_JIFFIES(TIME)((TIME) * HZ / 1000)
+#define JIFFIES_TO_MS(TIME)((TIME) * 1000 / HZ)
 
 #define TASK_PREEMPTS_CURR(p, curr)((p)->prio < (curr)->prio)
 
@@ -852,16 +855,15 @@ static void requeue_task(struct task_str
 
 /*
  * task_timeslice - the total duration a task can run during one major
- * rotation.
+ * rotation. Returns value in jiffies.
  */
 static inline int task_timeslice(struct task_struct *p)
 {
-   int slice, rr;
+   int slice;
 
-   slice = rr = p->quota;
+   slice = NS_TO_JIFFIES(p->quota);
if (!rt_task(p))
-   slice += (PRIO_RANGE - 1 - TASK_USER_PRIO(p)) * rr;
-   slice = NS_TO_JIFFIES(slice) ? : 1;
+   slice += (PRIO_RANGE - 1 - TASK_USER_PRIO(p)) * slice;
return slice;
 }
 
@@ -875,7 +877,7 @@ static inline int task_timeslice(struct 
(((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO)
 #define TASK_LOAD_WEIGHT(p)LOAD_WEIGHT(task_timeslice(p))
 #define RTPRIO_TO_LOAD_WEIGHT(rp)  \
-   (LOAD_WEIGHT((rr_interval + 20 + (rp
+   (LOAD_WEIGHT((MS_TO_JIFFIES(rr_interval) + 20 + (rp
 
 static void set_load_weight(struct task_struct *p)
 {
@@ -973,11 +975,15 @@ static int effective_prio(struct task_st
  * tick still. Below nice 0 they get progressively larger.
  * ie nice -6..0 = rr_interval. nice -10 = 2.5 * rr_interval
  * nice -20 = 10 * rr_interval. nice 1-19 = rr_interval / 2.
+ * Value returned is in nanoseconds.
  */
 static unsigned int rr_quota(struct task_struct *p)
 {
int nice = TASK_NICE(p), rr = rr_interval;
 
+   /* Ensure that rr_interval is at least 1 tick */
+   if (unlikely(!MS_TO_JIFFIES(rr)))
+   rr = rr_interval = JIFFIES_TO_MS(1) ? : 1;
if (!rt_task(p)) {
if (nice < -6) {
rr *= nice * nice;
@@ -3198,13 +3204,34 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 /*
  * This is called on clock ticks and on context switches.
  * Bank in p->sched_time the ns elapsed since the last tick or switch.
+ * CPU scheduler quota accounting is also performed here.
+ * The value returned from sched_clock() occasionally gives bogus values so
+ * some sanity checking is required.
  */
 static inline void
-update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
+update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now,
+int tick)
 {
cputime64_t time_diff = now - p->last_ran;
+   unsigned int min_diff = 1000;
 
-   /* cpu scheduler quota accounting is performed here */
+   if (tick) {
+   /*
+* Called from scheduler_tick() there should be less than two
+* jiffies worth, and not negative/overflow.
+*/
+   if (time_diff > JIFFIES_TO_NS(2) || time_diff < min_diff)
+   time_diff = JIFFIES_TO_NS(1);
+   } else {
+   /*
+* Called from context_switch there should be less than one
+

[PATCH] sched: staircase deadline misc fixes

2007-03-28 Thread Con Kolivas

test.kernel.org found some idle time regressions in the latest update to the
staircase deadline scheduler and Andy Whitcroft helped me track down the 
offending problem which was present in all previous RSDL schedulers but
previously wouldn't be manifest without changes in nice. So here is a bugfix
for the set_load_weight being incorrectly set and a few other minor 
improvements. Thanks Andy!

I'm cautiously optimistic that we're at the thin edge of the bugfix wedge now.

---
set_load_weight() should be performed after p-quota is set. This fixes a
large SMP performance regression.

Make sure rr_interval is never set to less than one jiffy.

Some sanity checking in update_cpu_clock will prevent bogus sched_clock
values.

SCHED_BATCH tasks should not set the rq-best_static_prio field.

Correct sysctl rr_interval description to describe the value in milliseconds.

Style fixes.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 Documentation/sysctl/kernel.txt |8 ++--
 kernel/sched.c  |   73 +---
 2 files changed, 58 insertions(+), 23 deletions(-)

Index: linux-2.6.21-rc5-mm2/kernel/sched.c
===
--- linux-2.6.21-rc5-mm2.orig/kernel/sched.c2007-03-28 09:01:03.0 
+1000
+++ linux-2.6.21-rc5-mm2/kernel/sched.c 2007-03-29 00:02:33.0 +1000
@@ -88,10 +88,13 @@ unsigned long long __attribute__((weak))
 #define MAX_USER_PRIO  (USER_PRIO(MAX_PRIO))
 #define SCHED_PRIO(p)  ((p)+MAX_RT_PRIO)
 
-/* Some helpers for converting to/from nanosecond timing */
+/* Some helpers for converting to/from various scales.*/
 #define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
-#define NS_TO_MS(TIME) ((TIME) / 100)
+#define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
 #define MS_TO_NS(TIME) ((TIME) * 100)
+/* Can return 0 */
+#define MS_TO_JIFFIES(TIME)((TIME) * HZ / 1000)
+#define JIFFIES_TO_MS(TIME)((TIME) * 1000 / HZ)
 
 #define TASK_PREEMPTS_CURR(p, curr)((p)-prio  (curr)-prio)
 
@@ -852,16 +855,15 @@ static void requeue_task(struct task_str
 
 /*
  * task_timeslice - the total duration a task can run during one major
- * rotation.
+ * rotation. Returns value in jiffies.
  */
 static inline int task_timeslice(struct task_struct *p)
 {
-   int slice, rr;
+   int slice;
 
-   slice = rr = p-quota;
+   slice = NS_TO_JIFFIES(p-quota);
if (!rt_task(p))
-   slice += (PRIO_RANGE - 1 - TASK_USER_PRIO(p)) * rr;
-   slice = NS_TO_JIFFIES(slice) ? : 1;
+   slice += (PRIO_RANGE - 1 - TASK_USER_PRIO(p)) * slice;
return slice;
 }
 
@@ -875,7 +877,7 @@ static inline int task_timeslice(struct 
(((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO)
 #define TASK_LOAD_WEIGHT(p)LOAD_WEIGHT(task_timeslice(p))
 #define RTPRIO_TO_LOAD_WEIGHT(rp)  \
-   (LOAD_WEIGHT((rr_interval + 20 + (rp
+   (LOAD_WEIGHT((MS_TO_JIFFIES(rr_interval) + 20 + (rp
 
 static void set_load_weight(struct task_struct *p)
 {
@@ -973,11 +975,15 @@ static int effective_prio(struct task_st
  * tick still. Below nice 0 they get progressively larger.
  * ie nice -6..0 = rr_interval. nice -10 = 2.5 * rr_interval
  * nice -20 = 10 * rr_interval. nice 1-19 = rr_interval / 2.
+ * Value returned is in nanoseconds.
  */
 static unsigned int rr_quota(struct task_struct *p)
 {
int nice = TASK_NICE(p), rr = rr_interval;
 
+   /* Ensure that rr_interval is at least 1 tick */
+   if (unlikely(!MS_TO_JIFFIES(rr)))
+   rr = rr_interval = JIFFIES_TO_MS(1) ? : 1;
if (!rt_task(p)) {
if (nice  -6) {
rr *= nice * nice;
@@ -3198,13 +3204,34 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 /*
  * This is called on clock ticks and on context switches.
  * Bank in p-sched_time the ns elapsed since the last tick or switch.
+ * CPU scheduler quota accounting is also performed here.
+ * The value returned from sched_clock() occasionally gives bogus values so
+ * some sanity checking is required.
  */
 static inline void
-update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
+update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now,
+int tick)
 {
cputime64_t time_diff = now - p-last_ran;
+   unsigned int min_diff = 1000;
 
-   /* cpu scheduler quota accounting is performed here */
+   if (tick) {
+   /*
+* Called from scheduler_tick() there should be less than two
+* jiffies worth, and not negative/overflow.
+*/
+   if (time_diff  JIFFIES_TO_NS(2) || time_diff  min_diff)
+   time_diff = JIFFIES_TO_NS(1);
+   } else {
+   /*
+* Called from context_switch there should be less than one
+* jiffy worth, and not negative/overflowed. In the case

Re: [PATCH] sched: staircase deadline misc fixes

2007-03-28 Thread Con Kolivas

On Thursday 29 March 2007 04:48, Ingo Molnar wrote:
 hm, how about the questions Mike raised (there were a couple of cases of
 friction between 'the design as documented and announced' and 'the code
 as implemented')? As far as i saw they were still largely unanswered -
 but let me know if they are all answered and addressed:

I spent less time emailing and more time coding. I have been working on 
addressing whatever people brought up.

  http://marc.info/?l=linux-kernelm=117465220309006w=2

Attended to.

  http://marc.info/?l=linux-kernelm=117489673929124w=2

Attended to.

  http://marc.info/?l=linux-kernelm=117489831930240w=2

Checked fine.

 and the numbers he posted:

  http://marc.info/?l=linux-kernelm=117448900626028w=2

Attended to.

 his test conclusion was that under CPU load, RSDL (SD) generally does
 not hold up to mainline's interactivity.

There have been improvements since the earlier iterations but it's still a 
fairness based design. Mike's sticking point test case should be improved 
as well.

My call based on my own testing and feedback from users is: 

Under niced loads it is 99% in favour of SD.

Under light loads it is 95% in favour of SD.

Under Heavy loads it becomes proportionately in favour of mainline. The 
crossover is somewhere around a load of 4.

If the reluctance to renice X goes away I'd say it was 99% across the board 
and to much higher loads.

   Ingo

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: staircase deadline misc fixes

2007-03-28 Thread Con Kolivas

On Thursday 29 March 2007 02:37, Con Kolivas wrote:
 I'm cautiously optimistic that we're at the thin edge of the bugfix wedge
 now.

My neck condition got a lot worse today. I'm forced offline for a week and 
will be uncontactable.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ 5/5] sched: document sd cpu scheduler

2007-03-26 Thread Con Kolivas

Add comprehensive documentation of the Staircase Deadline cpu scheduler design.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---

 Documentation/sched-design.txt |  240 +++--
 1 file changed, 234 insertions(+), 6 deletions(-)

Index: linux-2.6.21-rc5-sd/Documentation/sched-design.txt
===
--- linux-2.6.21-rc5-sd.orig/Documentation/sched-design.txt 2006-11-30 
11:30:31.0 +1100
+++ linux-2.6.21-rc5-sd/Documentation/sched-design.txt  2007-03-27 
11:52:55.0 +1000
@@ -1,11 +1,14 @@
-  Goals, Design and Implementation of the
- new ultra-scalable O(1) scheduler
+ Goals, Design and Implementation of the ultra-scalable O(1) scheduler by
+ Ingo Molnar and theStaircase Deadline cpu scheduler policy designed by
+ Con Kolivas.
 
 
-  This is an edited version of an email Ingo Molnar sent to
-  lkml on 4 Jan 2002.  It describes the goals, design, and
-  implementation of Ingo's new ultra-scalable O(1) scheduler.
-  Last Updated: 18 April 2002.
+  This was originally an edited version of an email Ingo Molnar sent to
+  lkml on 4 Jan 2002.  It describes the goals, design, and implementation
+  of Ingo's ultra-scalable O(1) scheduler. It now contains a description
+  of the Staircase Deadline priority scheduler that was built on this
+  design.
+  Last Updated: Tue Mar 27 2007
 
 
 Goal
@@ -163,3 +166,228 @@ certain code paths and data constructs. 
 code is smaller than the old one.
 
Ingo
+
+
+Staircase Deadline cpu scheduler policy
+
+
+Design summary
+==
+
+A novel design which incorporates a foreground-background descending priority
+system (the staircase) via a bandwidth allocation matrix according to nice
+level.
+
+
+Features
+
+
+A starvation free, strict fairness O(1) scalable design with interactivity
+as good as the above restrictions can provide. There is no interactivity
+estimator, no sleep/run measurements and only simple fixed accounting.
+The design has strict enough a design and accounting that task behaviour
+can be modelled and maximum scheduling latencies can be predicted by
+the virtual deadline mechanism that manages runqueues. The prime concern
+in this design is to maintain fairness at all costs determined by nice level,
+yet to maintain as good interactivity as can be allowed within the
+constraints of strict fairness.
+
+
+Design description
+==
+
+SD works off the principle of providing each task a quota of runtime that it is
+allowed to run at a number of priority levels determined by its static priority
+(ie. its nice level). If the task uses up its quota it has its priority
+decremented to the next level determined by a priority matrix. Once every
+runtime quota has been consumed of every priority level, a task is queued on 
the
+"expired" array. When no other tasks exist with quota, the expired array is
+activated and fresh quotas are handed out. This is all done in O(1).
+
+Design details
+==
+
+Each task keeps a record of its own entitlement of cpu time. Most of the rest 
of
+these details apply to non-realtime tasks as rt task management is straight
+forward.
+
+Each runqueue keeps a record of what major epoch it is up to in the
+rq->prio_rotation field which is incremented on each major epoch. It also
+keeps a record of the current prio_level for each static priority task.
+
+Each task keeps a record of what major runqueue epoch it was last running
+on in p->rotation. It also keeps a record of what priority levels it has
+already been allocated quota from during this epoch in a bitmap p->bitmap.
+
+The only tunable that determines all other details is the RR_INTERVAL. This
+is set to 8ms, and is scaled gently upwards with more cpus. This value is
+tunable via a /proc interface.
+
+All tasks are initially given a quota based on RR_INTERVAL. This is equal to
+RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and
+progressively larger for nice values from -1 to -20. This is assigned to
+p->quota and only changes with changes in nice level.
+
+As a task is first queued, it checks in recalc_task_prio to see if it has run 
at
+this runqueue's current priority rotation. If it has not, it will have its
+p->prio level set according to the first slot in a "priority matrix" and will 
be
+given a p->time_slice equal to the p->quota, and has its allocation bitmap bit
+set in p->bitmap for this prio level. It is then queued on the current active
+priority array.
+
+If a task has already been running during this major epoch, and it has
+p->time_slice left and the rq->prio_quota for the task's p->prio still
+has quota, it will be placed back on the active array, but no more quota
+will be added.
+
+If a task has been running during this major epoch, but does

[PATCH][ 4/5] sched: remove noninteractive flag

2007-03-26 Thread Con Kolivas

Remove the TASK_NONINTERACTIVE flag as it will no longer be used.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---

 fs/pipe.c |7 +--
 include/linux/sched.h |3 +--
 2 files changed, 2 insertions(+), 8 deletions(-)

Index: linux-2.6.21-rc5-sd/fs/pipe.c
===
--- linux-2.6.21-rc5-sd.orig/fs/pipe.c  2007-03-26 11:03:31.0 +1000
+++ linux-2.6.21-rc5-sd/fs/pipe.c   2007-03-27 11:52:55.0 +1000
@@ -41,12 +41,7 @@ void pipe_wait(struct pipe_inode_info *p
 {
DEFINE_WAIT(wait);
 
-   /*
-* Pipes are system-local resources, so sleeping on them
-* is considered a noninteractive wait:
-*/
-   prepare_to_wait(>wait, ,
-   TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE);
+   prepare_to_wait(>wait, , TASK_INTERRUPTIBLE);
if (pipe->inode)
mutex_unlock(>inode->i_mutex);
schedule();
Index: linux-2.6.21-rc5-sd/include/linux/sched.h
===
--- linux-2.6.21-rc5-sd.orig/include/linux/sched.h  2007-03-27 
11:52:55.0 +1000
+++ linux-2.6.21-rc5-sd/include/linux/sched.h   2007-03-27 11:52:55.0 
+1000
@@ -149,8 +149,7 @@ extern unsigned long weighted_cpuload(co
 #define EXIT_ZOMBIE16
 #define EXIT_DEAD  32
 /* in tsk->state again */
-#define TASK_NONINTERACTIVE64
-#define TASK_DEAD  128
+#define TASK_DEAD  64
 
 #define __set_task_state(tsk, state_value) \
do { (tsk)->state = (state_value); } while (0)

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ 2/5] sched: remove sleepavg from proc

2007-03-26 Thread Con Kolivas

Remove the sleep_avg field from proc output as it will be removed from the
task_struct.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
---

 fs/proc/array.c |2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6.21-rc5-sd/fs/proc/array.c
===
--- linux-2.6.21-rc5-sd.orig/fs/proc/array.c2007-03-26 11:03:31.0 
+1000
+++ linux-2.6.21-rc5-sd/fs/proc/array.c 2007-03-27 11:52:55.0 +1000
@@ -165,7 +165,6 @@ static inline char * task_state(struct t
rcu_read_lock();
buffer += sprintf(buffer,
"State:\t%s\n"
-   "SleepAVG:\t%lu%%\n"
"Tgid:\t%d\n"
"Pid:\t%d\n"
"PPid:\t%d\n"
@@ -173,7 +172,6 @@ static inline char * task_state(struct t
"Uid:\t%d\t%d\t%d\t%d\n"
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p),
-   (p->sleep_avg/1024)*100/(102000/1024),
p->tgid, p->pid,
pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0,
pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid : 0,

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ 1/5] sched: dont renice kernel threads

2007-03-26 Thread Con Kolivas

The practice of renicing kernel threads to negative nice values is of
questionable benefit at best, and at worst leads to larger latencies when
kernel threads are busy on behalf of other tasks.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
---

 kernel/workqueue.c |2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6.21-rc5-sd/kernel/workqueue.c
===
--- linux-2.6.21-rc5-sd.orig/kernel/workqueue.c 2007-03-26 11:03:31.0 
+1000
+++ linux-2.6.21-rc5-sd/kernel/workqueue.c  2007-03-27 11:52:54.0 
+1000
@@ -355,8 +355,6 @@ static int worker_thread(void *__cwq)
if (!cwq->freezeable)
current->flags |= PF_NOFREEZE;
 
-   set_user_nice(current, -5);
-
/* Block and flush all signals */
sigfillset();
sigprocmask(SIG_BLOCK, , NULL);

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ 0/5] Staircase deadline v0.36

2007-03-26 Thread Con Kolivas

What follows is a clean major iteration of the (now) Staircase Deadline cpu 
scheduler.

Changes from RSDL v0.33:
- All accounting is moved to tasks in nanosecond resolution removing 
requirement for Rotation component entirely
- list_splice_tail is no longer required; dropped
- Nicer nice with smaller rr_intervals if HZ tolerates it
- Reworked SCHED_BATCH to keep same cpu bandwidith but lax latency
- Updated documentation

Patches that follow are for 2.6.21-rc5

Andrew please use these to replace the RSDL patches.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: debug rsdl 0.33

2007-03-26 Thread Con Kolivas

On Tuesday 27 March 2007 01:28, Andy Whitcroft wrote:
> Andy Whitcroft wrote:
> Subsequent to that Con suggested testing a refactored RSDL patch.  That
> patch seemed to work on the machine at hand, so tests have been
> submitted for all the affected machines.
>
> http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-mm1-rsdl-0.34-t
>est.patch
>
> ...
>
> Ok, the preliminary results are in and we seem to have good boots in the
> three machines I was hitting early boot oops.  So I think we can say
> that the new stack is a lot better than the old.
>
> Con, have a Tested-by:
> :/
>
> -apw

Well thank you very much indeed. I'm pleased that the code I decided to rip 
out of the next update also took whatever bug was there with it. Fortunately 
it also is not dependant on the buggy sched: accurate user accounting patch 
that I gave up on so here is an incremental from the current -mm queue to 
this code without the "accurate user accounting patch" component for anyone 
who's trying to track just what I'm planning on moving forward with.

http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-mm1/sched-rsdl-sd-0.35-test.patch

Summary:
 3 files changed, 86 insertions(+), 249 deletions(-)

It also makes lists-add_list_splice_tail.patch unnecessary

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-26 Thread Con Kolivas

On Monday 26 March 2007 15:11, Al Boldi wrote:
> Con Kolivas wrote:
> > Ok this one is heavily tested. Please try it when you find the time.
>
> It's better, but still skewed.  Try two chew.c's; they account 80% each.
>
> > ---
> > Currently we only do cpu accounting to userspace based on what is
> > actually happening precisely on each tick. The accuracy of that
> > accounting gets progressively worse the lower HZ is. As we already keep
> > accounting of nanosecond resolution we can accurately track user cpu,
> > nice cpu and idle cpu if we move the accounting to update_cpu_clock with
> > a nanosecond cpu_usage_stat entry.
>
> That's great and much needed, but this is still probed; so what's wrong
> with doing it in-lined?
>
> > This increases overhead slightly but
> > avoids the problem of tick aliasing errors making accounting unreliable.
>
> Higher scheduling accuracy may actually offset any overhead incurred, so
> it's well worth it; and if it's in-lined it should mean even less overhead.
>
> > +   /* Sanity check. It should never go backwards or ruin accounting
> > */ +   if (unlikely(now < p->last_ran))
> > +   goto out_set;
>
> If sched_clock() goes backwards, why not fix it, instead of hacking around
> it?
>
>
> Thanks!

Actually I'm going to give up this idea as not worth my effort given the 
sched_clock fsckage that seems to cause so much grief. If someone else wants 
to take up the challenge feel free to.

Andrew please drop this patch. It's still broken and I have too much on my 
plate to try and debug it sorry.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rSDl cpu scheduler version 0.34-test patch

2007-03-26 Thread Con Kolivas

On Monday 26 March 2007 15:00, Mike Galbraith wrote:
> On Mon, 2007-03-26 at 11:00 +1000, Con Kolivas wrote:
> > This is just for testing at the moment! The reason is the size of this
> > patch.
>
> (no testing done yet, but I have a couple comments)
>
> > In the interest of evolution, I've taken the RSDL cpu scheduler and
> > increased the resolution of the task timekeeping to nanosecond
> > resolution.
>
> + /* All the userspace visible cpu accounting is done here */
> + time_diff = now - p->last_ran;
> ...
> + /* cpu scheduler quota accounting is performed here */
> + if (p->policy != SCHED_FIFO)
> + p->time_slice -= time_diff;
>
> If we still have any jiffies resolution clocks out there, this could be
> a bit problematic.

Works fine with jiffy only resolution. sched_clock just returns the change 
when it happens. This leaves us with the accuracy of the previous code on 
hardware that doesn't give higher resolution time from sched_clock.

> +static inline void enqueue_pulled_task(struct rq *src_rq, struct rq *rq,
> +struct task_struct *p)
> +{
> + int queue_prio;
> +
> + p->array = rq->active; <== set
> + if (!rt_task(p)) {
> + if (p->rotation == src_rq->prio_rotation) {
> + if (p->array == src_rq->expired) { <== evaluate

I don't see a problem.

> + queue_expired(p, rq);
> + goto out_queue;
> + }
> + if (p->time_slice < 0)
> + task_new_array(p, rq);
> + } else
> + task_new_array(p, rq);
> + }
> + queue_prio = next_entitled_slot(p, rq);
>
> (bug aside, this special function really shouldn't exist imho, because
> there's nothing special going on.  we didn't need it before to do the
> same thing, so we shouldn't need it now.)

As the comment says, the likelihood that both runqueues happen to be at the 
same priority_level is very low so the exact position cannot be transposed in 
my opinion. I'll see if I can simplify it though.

> +static void recalc_task_prio(struct task_struct *p, struct rq *rq)
> +{
> + struct prio_array *array = rq->active;
> + int queue_prio;
> +
> + if (p->rotation == rq->prio_rotation) {
> + if (p->array == array) {
> + if (p->time_slice > 0)
> + return;
> + p->time_slice = p->quota;
> + } else if (p->array == rq->expired) {
> + queue_expired(p, rq);
> + return;
> + } else
> + task_new_array(p, rq);
> + } else
>
> Dequeueing a task still leaves a stale p->array laying around to be
> possibly evaluated later.

I don't see quite why that's a problem. If there's memory of the last dequeue 
and it enqueues at a different rotation it gets ignored. If it enqueues 
during the same rotation then that memory proves useful for ensuring it 
doesn't get a new full quota. Either way the array is always updated on 
enqueue so it wont be trying to add it to the wrong runlist.

> try_to_wake_up() doesn't currently evaluate 
> and set p->rotation (but should per design doc),

try_to_wake_up->activate_task->enqueue_task->recalc_task_prio which updates 
p->rotation

> so when you get here, a 
> cross-cpu waking task won't continue it's rotation.  If it did evaluate
> and set, recalc_task_prio() would evaluate the guaranteed to fail these
> tests array pointer, so the task will still not continue it's rotation.

> Stale pointers are evil.

I prefer to use the array value as a memory in case it wakes up on the same 
rotation and runqueue.

>
>   -Mike

Thanks.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rSDl cpu scheduler version 0.34-test patch

2007-03-26 Thread Con Kolivas

On Monday 26 March 2007 15:00, Mike Galbraith wrote:
 On Mon, 2007-03-26 at 11:00 +1000, Con Kolivas wrote:
  This is just for testing at the moment! The reason is the size of this
  patch.

 (no testing done yet, but I have a couple comments)

  In the interest of evolution, I've taken the RSDL cpu scheduler and
  increased the resolution of the task timekeeping to nanosecond
  resolution.

 + /* All the userspace visible cpu accounting is done here */
 + time_diff = now - p-last_ran;
 ...
 + /* cpu scheduler quota accounting is performed here */
 + if (p-policy != SCHED_FIFO)
 + p-time_slice -= time_diff;

 If we still have any jiffies resolution clocks out there, this could be
 a bit problematic.

Works fine with jiffy only resolution. sched_clock just returns the change 
when it happens. This leaves us with the accuracy of the previous code on 
hardware that doesn't give higher resolution time from sched_clock.

 +static inline void enqueue_pulled_task(struct rq *src_rq, struct rq *rq,
 +struct task_struct *p)
 +{
 + int queue_prio;
 +
 + p-array = rq-active; == set
 + if (!rt_task(p)) {
 + if (p-rotation == src_rq-prio_rotation) {
 + if (p-array == src_rq-expired) { == evaluate

I don't see a problem.

 + queue_expired(p, rq);
 + goto out_queue;
 + }
 + if (p-time_slice  0)
 + task_new_array(p, rq);
 + } else
 + task_new_array(p, rq);
 + }
 + queue_prio = next_entitled_slot(p, rq);

 (bug aside, this special function really shouldn't exist imho, because
 there's nothing special going on.  we didn't need it before to do the
 same thing, so we shouldn't need it now.)

As the comment says, the likelihood that both runqueues happen to be at the 
same priority_level is very low so the exact position cannot be transposed in 
my opinion. I'll see if I can simplify it though.

 +static void recalc_task_prio(struct task_struct *p, struct rq *rq)
 +{
 + struct prio_array *array = rq-active;
 + int queue_prio;
 +
 + if (p-rotation == rq-prio_rotation) {
 + if (p-array == array) {
 + if (p-time_slice  0)
 + return;
 + p-time_slice = p-quota;
 + } else if (p-array == rq-expired) {
 + queue_expired(p, rq);
 + return;
 + } else
 + task_new_array(p, rq);
 + } else

 Dequeueing a task still leaves a stale p-array laying around to be
 possibly evaluated later.

I don't see quite why that's a problem. If there's memory of the last dequeue 
and it enqueues at a different rotation it gets ignored. If it enqueues 
during the same rotation then that memory proves useful for ensuring it 
doesn't get a new full quota. Either way the array is always updated on 
enqueue so it wont be trying to add it to the wrong runlist.

 try_to_wake_up() doesn't currently evaluate 
 and set p-rotation (but should per design doc),

try_to_wake_up-activate_task-enqueue_task-recalc_task_prio which updates 
p-rotation

 so when you get here, a 
 cross-cpu waking task won't continue it's rotation.  If it did evaluate
 and set, recalc_task_prio() would evaluate the guaranteed to fail these
 tests array pointer, so the task will still not continue it's rotation.

 Stale pointers are evil.

I prefer to use the array value as a memory in case it wakes up on the same 
rotation and runqueue.


   -Mike

Thanks.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-26 Thread Con Kolivas

On Monday 26 March 2007 15:11, Al Boldi wrote:
 Con Kolivas wrote:
  Ok this one is heavily tested. Please try it when you find the time.

 It's better, but still skewed.  Try two chew.c's; they account 80% each.

  ---
  Currently we only do cpu accounting to userspace based on what is
  actually happening precisely on each tick. The accuracy of that
  accounting gets progressively worse the lower HZ is. As we already keep
  accounting of nanosecond resolution we can accurately track user cpu,
  nice cpu and idle cpu if we move the accounting to update_cpu_clock with
  a nanosecond cpu_usage_stat entry.

 That's great and much needed, but this is still probed; so what's wrong
 with doing it in-lined?

  This increases overhead slightly but
  avoids the problem of tick aliasing errors making accounting unreliable.

 Higher scheduling accuracy may actually offset any overhead incurred, so
 it's well worth it; and if it's in-lined it should mean even less overhead.

  +   /* Sanity check. It should never go backwards or ruin accounting
  */ +   if (unlikely(now  p-last_ran))
  +   goto out_set;

 If sched_clock() goes backwards, why not fix it, instead of hacking around
 it?


 Thanks!

Actually I'm going to give up this idea as not worth my effort given the 
sched_clock fsckage that seems to cause so much grief. If someone else wants 
to take up the challenge feel free to.

Andrew please drop this patch. It's still broken and I have too much on my 
plate to try and debug it sorry.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: debug rsdl 0.33

2007-03-26 Thread Con Kolivas

On Tuesday 27 March 2007 01:28, Andy Whitcroft wrote:
 Andy Whitcroft wrote:
 Subsequent to that Con suggested testing a refactored RSDL patch.  That
 patch seemed to work on the machine at hand, so tests have been
 submitted for all the affected machines.

 http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-mm1-rsdl-0.34-t
est.patch

 ...

 Ok, the preliminary results are in and we seem to have good boots in the
 three machines I was hitting early boot oops.  So I think we can say
 that the new stack is a lot better than the old.

 Con, have a Tested-by:
 :/

 -apw

Well thank you very much indeed. I'm pleased that the code I decided to rip 
out of the next update also took whatever bug was there with it. Fortunately 
it also is not dependant on the buggy sched: accurate user accounting patch 
that I gave up on so here is an incremental from the current -mm queue to 
this code without the accurate user accounting patch component for anyone 
who's trying to track just what I'm planning on moving forward with.

http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-mm1/sched-rsdl-sd-0.35-test.patch

Summary:
 3 files changed, 86 insertions(+), 249 deletions(-)

It also makes lists-add_list_splice_tail.patch unnecessary

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ 0/5] Staircase deadline v0.36

2007-03-26 Thread Con Kolivas

What follows is a clean major iteration of the (now) Staircase Deadline cpu 
scheduler.

Changes from RSDL v0.33:
- All accounting is moved to tasks in nanosecond resolution removing 
requirement for Rotation component entirely
- list_splice_tail is no longer required; dropped
- Nicer nice with smaller rr_intervals if HZ tolerates it
- Reworked SCHED_BATCH to keep same cpu bandwidith but lax latency
- Updated documentation

Patches that follow are for 2.6.21-rc5

Andrew please use these to replace the RSDL patches.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ 1/5] sched: dont renice kernel threads

2007-03-26 Thread Con Kolivas

The practice of renicing kernel threads to negative nice values is of
questionable benefit at best, and at worst leads to larger latencies when
kernel threads are busy on behalf of other tasks.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]
---

 kernel/workqueue.c |2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6.21-rc5-sd/kernel/workqueue.c
===
--- linux-2.6.21-rc5-sd.orig/kernel/workqueue.c 2007-03-26 11:03:31.0 
+1000
+++ linux-2.6.21-rc5-sd/kernel/workqueue.c  2007-03-27 11:52:54.0 
+1000
@@ -355,8 +355,6 @@ static int worker_thread(void *__cwq)
if (!cwq-freezeable)
current-flags |= PF_NOFREEZE;
 
-   set_user_nice(current, -5);
-
/* Block and flush all signals */
sigfillset(blocked);
sigprocmask(SIG_BLOCK, blocked, NULL);

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ 2/5] sched: remove sleepavg from proc

2007-03-26 Thread Con Kolivas

Remove the sleep_avg field from proc output as it will be removed from the
task_struct.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]
---

 fs/proc/array.c |2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6.21-rc5-sd/fs/proc/array.c
===
--- linux-2.6.21-rc5-sd.orig/fs/proc/array.c2007-03-26 11:03:31.0 
+1000
+++ linux-2.6.21-rc5-sd/fs/proc/array.c 2007-03-27 11:52:55.0 +1000
@@ -165,7 +165,6 @@ static inline char * task_state(struct t
rcu_read_lock();
buffer += sprintf(buffer,
State:\t%s\n
-   SleepAVG:\t%lu%%\n
Tgid:\t%d\n
Pid:\t%d\n
PPid:\t%d\n
@@ -173,7 +172,6 @@ static inline char * task_state(struct t
Uid:\t%d\t%d\t%d\t%d\n
Gid:\t%d\t%d\t%d\t%d\n,
get_task_state(p),
-   (p-sleep_avg/1024)*100/(102000/1024),
p-tgid, p-pid,
pid_alive(p) ? rcu_dereference(p-real_parent)-tgid : 0,
pid_alive(p)  p-ptrace ? rcu_dereference(p-parent)-pid : 0,

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ 4/5] sched: remove noninteractive flag

2007-03-26 Thread Con Kolivas

Remove the TASK_NONINTERACTIVE flag as it will no longer be used.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---

 fs/pipe.c |7 +--
 include/linux/sched.h |3 +--
 2 files changed, 2 insertions(+), 8 deletions(-)

Index: linux-2.6.21-rc5-sd/fs/pipe.c
===
--- linux-2.6.21-rc5-sd.orig/fs/pipe.c  2007-03-26 11:03:31.0 +1000
+++ linux-2.6.21-rc5-sd/fs/pipe.c   2007-03-27 11:52:55.0 +1000
@@ -41,12 +41,7 @@ void pipe_wait(struct pipe_inode_info *p
 {
DEFINE_WAIT(wait);
 
-   /*
-* Pipes are system-local resources, so sleeping on them
-* is considered a noninteractive wait:
-*/
-   prepare_to_wait(pipe-wait, wait,
-   TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE);
+   prepare_to_wait(pipe-wait, wait, TASK_INTERRUPTIBLE);
if (pipe-inode)
mutex_unlock(pipe-inode-i_mutex);
schedule();
Index: linux-2.6.21-rc5-sd/include/linux/sched.h
===
--- linux-2.6.21-rc5-sd.orig/include/linux/sched.h  2007-03-27 
11:52:55.0 +1000
+++ linux-2.6.21-rc5-sd/include/linux/sched.h   2007-03-27 11:52:55.0 
+1000
@@ -149,8 +149,7 @@ extern unsigned long weighted_cpuload(co
 #define EXIT_ZOMBIE16
 #define EXIT_DEAD  32
 /* in tsk-state again */
-#define TASK_NONINTERACTIVE64
-#define TASK_DEAD  128
+#define TASK_DEAD  64
 
 #define __set_task_state(tsk, state_value) \
do { (tsk)-state = (state_value); } while (0)

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ 5/5] sched: document sd cpu scheduler

2007-03-26 Thread Con Kolivas

Add comprehensive documentation of the Staircase Deadline cpu scheduler design.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---

 Documentation/sched-design.txt |  240 +++--
 1 file changed, 234 insertions(+), 6 deletions(-)

Index: linux-2.6.21-rc5-sd/Documentation/sched-design.txt
===
--- linux-2.6.21-rc5-sd.orig/Documentation/sched-design.txt 2006-11-30 
11:30:31.0 +1100
+++ linux-2.6.21-rc5-sd/Documentation/sched-design.txt  2007-03-27 
11:52:55.0 +1000
@@ -1,11 +1,14 @@
-  Goals, Design and Implementation of the
- new ultra-scalable O(1) scheduler
+ Goals, Design and Implementation of the ultra-scalable O(1) scheduler by
+ Ingo Molnar and theStaircase Deadline cpu scheduler policy designed by
+ Con Kolivas.
 
 
-  This is an edited version of an email Ingo Molnar sent to
-  lkml on 4 Jan 2002.  It describes the goals, design, and
-  implementation of Ingo's new ultra-scalable O(1) scheduler.
-  Last Updated: 18 April 2002.
+  This was originally an edited version of an email Ingo Molnar sent to
+  lkml on 4 Jan 2002.  It describes the goals, design, and implementation
+  of Ingo's ultra-scalable O(1) scheduler. It now contains a description
+  of the Staircase Deadline priority scheduler that was built on this
+  design.
+  Last Updated: Tue Mar 27 2007
 
 
 Goal
@@ -163,3 +166,228 @@ certain code paths and data constructs. 
 code is smaller than the old one.
 
Ingo
+
+
+Staircase Deadline cpu scheduler policy
+
+
+Design summary
+==
+
+A novel design which incorporates a foreground-background descending priority
+system (the staircase) via a bandwidth allocation matrix according to nice
+level.
+
+
+Features
+
+
+A starvation free, strict fairness O(1) scalable design with interactivity
+as good as the above restrictions can provide. There is no interactivity
+estimator, no sleep/run measurements and only simple fixed accounting.
+The design has strict enough a design and accounting that task behaviour
+can be modelled and maximum scheduling latencies can be predicted by
+the virtual deadline mechanism that manages runqueues. The prime concern
+in this design is to maintain fairness at all costs determined by nice level,
+yet to maintain as good interactivity as can be allowed within the
+constraints of strict fairness.
+
+
+Design description
+==
+
+SD works off the principle of providing each task a quota of runtime that it is
+allowed to run at a number of priority levels determined by its static priority
+(ie. its nice level). If the task uses up its quota it has its priority
+decremented to the next level determined by a priority matrix. Once every
+runtime quota has been consumed of every priority level, a task is queued on 
the
+expired array. When no other tasks exist with quota, the expired array is
+activated and fresh quotas are handed out. This is all done in O(1).
+
+Design details
+==
+
+Each task keeps a record of its own entitlement of cpu time. Most of the rest 
of
+these details apply to non-realtime tasks as rt task management is straight
+forward.
+
+Each runqueue keeps a record of what major epoch it is up to in the
+rq-prio_rotation field which is incremented on each major epoch. It also
+keeps a record of the current prio_level for each static priority task.
+
+Each task keeps a record of what major runqueue epoch it was last running
+on in p-rotation. It also keeps a record of what priority levels it has
+already been allocated quota from during this epoch in a bitmap p-bitmap.
+
+The only tunable that determines all other details is the RR_INTERVAL. This
+is set to 8ms, and is scaled gently upwards with more cpus. This value is
+tunable via a /proc interface.
+
+All tasks are initially given a quota based on RR_INTERVAL. This is equal to
+RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and
+progressively larger for nice values from -1 to -20. This is assigned to
+p-quota and only changes with changes in nice level.
+
+As a task is first queued, it checks in recalc_task_prio to see if it has run 
at
+this runqueue's current priority rotation. If it has not, it will have its
+p-prio level set according to the first slot in a priority matrix and will 
be
+given a p-time_slice equal to the p-quota, and has its allocation bitmap bit
+set in p-bitmap for this prio level. It is then queued on the current active
+priority array.
+
+If a task has already been running during this major epoch, and it has
+p-time_slice left and the rq-prio_quota for the task's p-prio still
+has quota, it will be placed back on the active array, but no more quota
+will be added.
+
+If a task has been running during this major epoch, but does not have
+p-time_slice left, it will find the next lowest priority in its bitmap

Re: RSDL 0.31 causes slowdown

2007-03-25 Thread Con Kolivas

On Saturday 24 March 2007 04:57, Tim Chen wrote:
> On Fri, 2007-03-23 at 13:40 +1100, Con Kolivas wrote:
> > Volanomark is a purely yield() semantic dependant workload (as
> > discussed many times previously). In the earlier form of RSDL I
> > softened the effect of sched_yield but other changes since then have
> > made that softness bordering on a noop. Obviously when sched_yield is
> > relied upon that will not be enough. Extending the rr interval simply
> > makes the yield slightly more effective and is not the proper
> > workaround. Since expiration of arrays is a regular frequent
> > occurrence in RSDL then changing yield semantics back to expiration
> > should cause a massive improvement in these values, without making the
> > yields as long as in mainline. It's impossible to know exactly what
> > the final result will be since java uses this timing sensitive yield
> > for locking but we can improve it drastically from this. I'll make a
> > patch soon to change yield again.
>
> Con,
>
> The new RSDL 0.33 has fully recovered the loss in performance for
> Volanomark.  The throughput for Volanomark is at the same level as
> mainline 2.6.21-rc4 kernel.
>
> Tim

Thanks very much for testing. I'm quite happy with the yield semantics staying 
the way they are in rSDl 0.33+.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

rSDl cpu scheduler version 0.34-test patch

2007-03-25 Thread Con Kolivas

This is just for testing at the moment! The reason is the size of this patch.

In the interest of evolution, I've taken the RSDL cpu scheduler and increased 
the resolution of the task timekeeping to nanosecond resolution. This removes 
the need for the runqueue rotation component entirely out of RSDL. The design 
basically is mostly unchanged, minus over 150 lines of code for the rotation, 
yet should be slightly better performing. It should be indistinguishable in 
usage from v0.33.

Other changes from v0.33:
-rr interval was not being properly scaled with HZ
-fix possible race in checking task_queued in task_running_tick
-scale down rr interval for niced tasks if HZ can tolerate it
-cull list_splice_tail

What does this mean for the next version of RSDL?

Assuming all works as expected on these test patches, it will be cleanest to 
submit a new series of patches for -mm with the renamed Staircase-Deadline 
scheduler and new documentation (when it's done).


So for testing here are full rollups for 2.6.20.4 and 2.6.21-rc4:
http://ck.kolivas.org/patches/staircase-deadline/2.6.20.4-sd-0.34-test.patch
http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-sd-0.34-test.patch

The patches available also include a rollup of sched: accurate user accounting 
as this code touches the same area and it is most convenient to include them 
together.

(incrementals in each subdir of staircase-deadline/ for those interested).

Thanks Mike for continuing to attempt to use the cluebat on me on this one. 
>From the start I wasn't sure if this was necessary or not but ends up being 
less code than RSDL.

While I'm still far from being well, luckily I am in much better shape to be 
able to spend the time at the pc to have done this. Thanks to all those who 
expressed their concern.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 09:01, Con Kolivas wrote:
> On Monday 26 March 2007 03:14, malc wrote:
> > On Mon, 26 Mar 2007, Con Kolivas wrote:
> > > On Monday 26 March 2007 01:19, malc wrote:
> > Erm... i just looked at the code and suddenly it stopped making any sense
> > at all:
> >
> >  p->last_ran = rq->most_recent_timestamp = now;
> >  /* Sanity check. It should never go backwards or ruin accounting
> > */ if (unlikely(now < p->last_ran))
> >  return;
> >  time_diff = now - p->last_ran;
> >
> > First `now' is assigned to `p->last_ran' and the very next line
> > compares those two values, and then the difference is taken.. I quite
> > frankly am either very tired or fail to see the point.. time_diff is
> > either always zero or there's always a race here.
>
> Bah major thinko error on my part! That will teach me to post patches
> untested at 1:30 am. I'll try again shortly sorry.

Ok this one is heavily tested. Please try it when you find the time.

---
Currently we only do cpu accounting to userspace based on what is 
actually happening precisely on each tick. The accuracy of that 
accounting gets progressively worse the lower HZ is. As we already keep 
accounting of nanosecond resolution we can accurately track user cpu, 
nice cpu and idle cpu if we move the accounting to update_cpu_clock with 
a nanosecond cpu_usage_stat entry. This increases overhead slightly but 
avoids the problem of tick aliasing errors making accounting unreliable.

Remove the now defunct Documentation/cpu-load.txt file.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 Documentation/cpu-load.txt  |  113 
 include/linux/kernel_stat.h |3 +
 include/linux/sched.h   |2 
 kernel/sched.c  |   58 +-
 kernel/timer.c  |5 -
 5 files changed, 60 insertions(+), 121 deletions(-)

Index: linux-2.6.21-rc4-acct/include/linux/kernel_stat.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/kernel_stat.h  2007-03-26 
00:56:05.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/kernel_stat.h   2007-03-26 
00:56:25.0 +1000
@@ -16,11 +16,14 @@
 
 struct cpu_usage_stat {
cputime64_t user;
+   cputime64_t user_ns;
cputime64_t nice;
+   cputime64_t nice_ns;
cputime64_t system;
cputime64_t softirq;
cputime64_t irq;
cputime64_t idle;
+   cputime64_t idle_ns;
cputime64_t iowait;
cputime64_t steal;
 };
Index: linux-2.6.21-rc4-acct/include/linux/sched.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/sched.h2007-03-26 
00:56:05.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/sched.h 2007-03-26 00:57:01.0 
+1000
@@ -882,7 +882,7 @@ struct task_struct {
int __user *clear_child_tid;/* CLONE_CHILD_CLEARTID */
 
unsigned long rt_priority;
-   cputime_t utime, stime;
+   cputime_t utime, utime_ns, stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time;
 /* mm fault and swap info: this can arguably be seen as either mm-specific or 
thread-specific */
Index: linux-2.6.21-rc4-acct/kernel/sched.c
===
--- linux-2.6.21-rc4-acct.orig/kernel/sched.c   2007-03-26 00:56:05.0 
+1000
+++ linux-2.6.21-rc4-acct/kernel/sched.c2007-03-26 09:38:50.0 
+1000
@@ -89,6 +89,7 @@ unsigned long long __attribute__((weak))
  */
 #define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
 #define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
+#define JIFFY_NS   JIFFIES_TO_NS(1)
 
 /*
  * These are the 'tuning knobs' of the scheduler:
@@ -3017,8 +3018,59 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 static inline void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
 {
-   p->sched_time += now - p->last_ran;
-   p->last_ran = rq->most_recent_timestamp = now;
+   struct cpu_usage_stat *cpustat = _this_cpu.cpustat;
+   cputime64_t time_diff;
+
+   /* Sanity check. It should never go backwards or ruin accounting */
+   if (unlikely(now < p->last_ran))
+   goto out_set;
+   /* All the userspace visible cpu accounting is done here */
+   time_diff = now - p->last_ran;
+   p->sched_time += time_diff;
+   if (p != rq->idle) {
+   cputime_t utime_diff = time_diff;
+
+   if (TASK_NICE(p) > 0) {
+   cpustat->nice_ns = cputime64_add(cpustat->nice_ns,
+time_diff);
+

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 03:14, malc wrote:
> On Mon, 26 Mar 2007, Con Kolivas wrote:
> > On Monday 26 March 2007 01:19, malc wrote:
> >> On Mon, 26 Mar 2007, Con Kolivas wrote:
> >>> So before we go any further with this patch, can you try the following
> >>> one and see if this simple sanity check is enough?
> >>
> >> Sure (compiling the kernel now), too bad old axiom that testing can not
> >> confirm absence of bugs holds.
> >>
> >> I have one nit and one request from clarification. Question first (i
> >> admit i haven't looked at the surroundings of the patch maybe things
> >> would have been are self evident if i did):
> >>
> >> What this patch amounts to is that the accounting logic is moved from
> >> timer interrupt to the place where scheduler switches task (or something
> >> to that effect)?
> >
> > Both the scheduler tick and context switch now. So yes it adds overhead
> > as I said, although we already do update_cpu_clock on context switch, but
> > it's not this complex.
> >
> >> [..snip..]
> >>
> >>>  * These are the 'tuning knobs' of the scheduler:
> >>> @@ -3017,8 +3018,53 @@ EXPORT_PER_CPU_SYMBOL(kstat);
> >>> static inline void
> >>> update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long
> >>> long now) {
> >>> - p->sched_time += now - p->last_ran;
> >>> + struct cpu_usage_stat *cpustat = _this_cpu.cpustat;
> >>> + cputime64_t time_diff;
> >>> +
> >>>   p->last_ran = rq->most_recent_timestamp = now;
> >>> + /* Sanity check. It should never go backwards or ruin accounting */
> >>> + if (unlikely(now < p->last_ran))
> >>> + return;
> >>> + time_diff = now - p->last_ran;
> >>
> >> A nit. Anything wrong with:
> >>
> >> time_diff = now - p->last_ran;
> >> if (unlikeley (LESS_THAN_ZERO (time_diff))
> >>  return;
> >
> > Does LESS_THAN_ZERO work on a cputime64_t on all arches? I can't figure
> > that out just by looking myself which is why I did it the other way.
>
> I have no idea what type cputime64_t really is, so used this imaginary
> LESS_THAN_ZERO thing.
>
> Erm... i just looked at the code and suddenly it stopped making any sense
> at all:
>
>  p->last_ran = rq->most_recent_timestamp = now;
>  /* Sanity check. It should never go backwards or ruin accounting
> */ if (unlikely(now < p->last_ran))
>  return;
>  time_diff = now - p->last_ran;
>
> First `now' is assigned to `p->last_ran' and the very next line
> compares those two values, and then the difference is taken.. I quite
> frankly am either very tired or fail to see the point.. time_diff is
> either always zero or there's always a race here.

Bah major thinko error on my part! That will teach me to post patches untested 
at 1:30 am. I'll try again shortly sorry.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: debug rsdl 0.33

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 08:49, Con Kolivas wrote:
> On Monday 26 March 2007 04:28, Torsten Kaiser wrote:
> > On 3/24/07, Con Kolivas <[EMAIL PROTECTED]> wrote:
> > >  kernel/sched.c |   51
> > > +++ 1 file changed, 51
> > > insertions(+)
> >
> > 2.6.21-rc4-mm1 also fails for me.
> >
> > I tried pure 2.6.21-rc4-mm1, +hotfixes, +hotfixes+rsdl33 and at last
> > also added above debug patch.
>
> Thank you very much for the effort!
>
> > The oops from with the debug-patch added:
> > [   65.426126] Freeing unused kernel memory: 312k freed
> > (on the console the system is starting up, getting until "Letting udev
> > process events ...")
> > [   66.665611] Unable to handle kernel NULL pointer dereference at
> > 0020 RIP:
> > [   66.682030]  [] __sched_text_start+0x4dc/0xa0e
>
> The debug patch didn't do anything. This means it is not an unset bitmap
> problem at all otherwise it should have self corrected itself.
>
> > The system in x86_64, two 2218 on a MCP55 nvidia chipset.
> >
> > 2.6.21-rc3-mm1 works fine.
> >
> > (gdb) list *0x8026167c
> > 0x8026167c is in schedule (kernel/sched.c:3619).
>
>   next = list_entry(queue->next, struct task_struct, run_list);
>   rq->prio_level = idx;
>
> > 3614/*
> > 3615 * When the task is chosen it is checked to see if its
> > quota has been
> > 3616 * added to this runqueue level which is only performed
> > once per 3617 * level per major rotation for each running
> > task. 3618 */
> > 3619if (next->rotation != rq->prio_rotation) {
>
> Urgh. Dereferencing there? That can only be next that's deferencing meaning
> the run_list entry is bogus. That should only ever be done under runqueue
> lock so I have a race somewhere where it's not. Time for more looking.

This is about the only place I can see the run_list is looked at unlocked. Can
you see if this simple patch helps? The debug patch is unnecessary now.

Thanks!

--
Ensure checking task_queued() is only done under runqueue lock.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 kernel/sched.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-26 08:54:15.0 
+1000
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-26 08:55:21.0 +1000
@@ -3421,16 +3421,16 @@ static inline void rotate_runqueue_prior
 
 static void task_running_tick(struct rq *rq, struct task_struct *p, int tick)
 {
-   if (unlikely(!task_queued(p))) {
-   /* Task has expired but was not scheduled yet */
-   set_tsk_need_resched(p);
-   return;
-   }
/* SCHED_FIFO tasks never run out of timeslice. */
if (unlikely(p->policy == SCHED_FIFO))
return;
 
spin_lock(>lock);
+   if (unlikely(!task_queued(p))) {
+   /* Task has expired but was not scheduled off yet */
+   set_tsk_need_resched(p);
+   goto out_unlock;
+   }
/*
 * Accounting is performed by both the task and the runqueue. This
 * allows frequently sleeping tasks to get their proper quota of


-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: debug rsdl 0.33

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 04:28, Torsten Kaiser wrote:
> On 3/24/07, Con Kolivas <[EMAIL PROTECTED]> wrote:
> >  kernel/sched.c |   51
> > +++ 1 file changed, 51
> > insertions(+)
>
> 2.6.21-rc4-mm1 also fails for me.
>
> I tried pure 2.6.21-rc4-mm1, +hotfixes, +hotfixes+rsdl33 and at last
> also added above debug patch.

Thank you very much for the effort!
>
> The oops from with the debug-patch added:
> [   65.426126] Freeing unused kernel memory: 312k freed
> (on the console the system is starting up, getting until "Letting udev
> process events ...")
> [   66.665611] Unable to handle kernel NULL pointer dereference at
> 0020 RIP:
> [   66.682030]  [] __sched_text_start+0x4dc/0xa0e

The debug patch didn't do anything. This means it is not an unset bitmap 
problem at all otherwise it should have self corrected itself.

> The system in x86_64, two 2218 on a MCP55 nvidia chipset.
>
> 2.6.21-rc3-mm1 works fine.
>
> (gdb) list *0x8026167c
> 0x8026167c is in schedule (kernel/sched.c:3619).

next = list_entry(queue->next, struct task_struct, run_list);
rq->prio_level = idx;

> 3614/*
> 3615 * When the task is chosen it is checked to see if its
> quota has been
> 3616 * added to this runqueue level which is only performed
> once per 3617 * level per major rotation for each running task.
> 3618 */
> 3619if (next->rotation != rq->prio_rotation) {

Urgh. Dereferencing there? That can only be next that's deferencing meaning 
the run_list entry is bogus. That should only ever be done under runqueue 
lock so I have a race somewhere where it's not. Time for more looking.

> Torsten

Thanks!

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 01:19, malc wrote:
> On Mon, 26 Mar 2007, Con Kolivas wrote:
> > So before we go any further with this patch, can you try the following
> > one and see if this simple sanity check is enough?
>
> Sure (compiling the kernel now), too bad old axiom that testing can not
> confirm absence of bugs holds.
>
> I have one nit and one request from clarification. Question first (i
> admit i haven't looked at the surroundings of the patch maybe things
> would have been are self evident if i did):
>
> What this patch amounts to is that the accounting logic is moved from
> timer interrupt to the place where scheduler switches task (or something
> to that effect)?

Both the scheduler tick and context switch now. So yes it adds overhead as I 
said, although we already do update_cpu_clock on context switch, but it's not 
this complex.

> [..snip..]
>
> >  * These are the 'tuning knobs' of the scheduler:
> > @@ -3017,8 +3018,53 @@ EXPORT_PER_CPU_SYMBOL(kstat);
> > static inline void
> > update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long
> > now) {
> > -   p->sched_time += now - p->last_ran;
> > +   struct cpu_usage_stat *cpustat = _this_cpu.cpustat;
> > +   cputime64_t time_diff;
> > +
> > p->last_ran = rq->most_recent_timestamp = now;
> > +   /* Sanity check. It should never go backwards or ruin accounting */
> > +   if (unlikely(now < p->last_ran))
> > +   return;
> > +   time_diff = now - p->last_ran;
>
> A nit. Anything wrong with:
>
> time_diff = now - p->last_ran;
> if (unlikeley (LESS_THAN_ZERO (time_diff))
>  return;

Does LESS_THAN_ZERO work on a cputime64_t on all arches? I can't figure that 
out just by looking myself which is why I did it the other way.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 00:57, malc wrote:
> On Mon, 26 Mar 2007, Con Kolivas wrote:
> > On Sunday 25 March 2007 23:06, malc wrote:
> >> On Sun, 25 Mar 2007, Con Kolivas wrote:
> >>> On Sunday 25 March 2007 21:46, Con Kolivas wrote:
> >>>> On Sunday 25 March 2007 21:34, malc wrote:
> >>>>> On Sun, 25 Mar 2007, Ingo Molnar wrote:
> >>>>>> * Con Kolivas <[EMAIL PROTECTED]> wrote:
> >>>>>>> For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
> >>
> >> [..snip..]
> >>
> >>> ---
> >>> Currently we only do cpu accounting to userspace based on what is
> >>> actually happening precisely on each tick. The accuracy of that
> >>> accounting gets progressively worse the lower HZ is. As we already keep
> >>> accounting of nanosecond resolution we can accurately track user cpu,
> >>> nice cpu and idle cpu if we move the accounting to update_cpu_clock
> >>> with a nanosecond cpu_usage_stat entry. This increases overhead
> >>> slightly but avoids the problem of tick aliasing errors making
> >>> accounting unreliable.
> >>>
> >>> Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
> >>> Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
> >>
> >> [..snip..]
> >>
> >> Forgot to mention. Given that this goes into the kernel, shouldn't
> >> Documentation/cpu-load.txt be amended/removed?
> >
> > Yes that's a good idea. Also there should be a sanity check because
> > sometimes for some reason noone's been able to explain to me sched_clock
> > gives a value which doesn't make sense (time appears to have gone
> > backwards) and that will completely ruin the accounting from then on.
>
> After running this new kernel for a while i guess i have hit this issue:
> http://www.boblycat.org/~malc/apc/bad-load.png
>
> Top and icewm's monitor do show incredibly huge load while in reality
> nothing like that is really happening. Both ad-hoc and `/proc/stat' (idle)
> show normal CPU utilization (7% since i'm doing some A/V stuff in the
> background)

Yes I'd say you hit the problem I described earlier. When playing with
sched_clock() I found it gave some "interesting" results fairly infrequently.
They could lead to ridiculous accounting mistakes.

So before we go any further with this patch, can you try the following one and 
see if this simple sanity check is enough?

Thanks!

---
Currently we only do cpu accounting to userspace based on what is 
actually happening precisely on each tick. The accuracy of that 
accounting gets progressively worse the lower HZ is. As we already keep 
accounting of nanosecond resolution we can accurately track user cpu, 
nice cpu and idle cpu if we move the accounting to update_cpu_clock with 
a nanosecond cpu_usage_stat entry. This increases overhead slightly but 
avoids the problem of tick aliasing errors making accounting unreliable.

Remove the now defunct Documentation/cpu-load.txt file.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
---
 Documentation/cpu-load.txt  |  113 
 include/linux/kernel_stat.h |3 +
 include/linux/sched.h   |2 
 kernel/sched.c  |   50 ++-
 kernel/timer.c  |5 -
 5 files changed, 53 insertions(+), 120 deletions(-)

Index: linux-2.6.21-rc4-acct/include/linux/kernel_stat.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/kernel_stat.h  2007-03-26 
00:56:05.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/kernel_stat.h   2007-03-26 
00:56:25.0 +1000
@@ -16,11 +16,14 @@
 
 struct cpu_usage_stat {
cputime64_t user;
+   cputime64_t user_ns;
cputime64_t nice;
+   cputime64_t nice_ns;
cputime64_t system;
cputime64_t softirq;
cputime64_t irq;
cputime64_t idle;
+   cputime64_t idle_ns;
cputime64_t iowait;
cputime64_t steal;
 };
Index: linux-2.6.21-rc4-acct/include/linux/sched.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/sched.h2007-03-26 
00:56:05.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/sched.h 2007-03-26 00:57:01.0 
+1000
@@ -882,7 +882,7 @@ struct task_struct {
int __user *clear_child_tid;/* CLONE_CHILD_CLEARTID */
 
unsigned long rt_priority;
-   cputime_t utime, stime;
+   cputime_t utime, utime_ns, stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time;
 /* mm fault and swap info: this can arguably be seen as either mm-specific or 
thread-spec

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 23:06, malc wrote:
> On Sun, 25 Mar 2007, Con Kolivas wrote:
> > On Sunday 25 March 2007 21:46, Con Kolivas wrote:
> >> On Sunday 25 March 2007 21:34, malc wrote:
> >>> On Sun, 25 Mar 2007, Ingo Molnar wrote:
> >>>> * Con Kolivas <[EMAIL PROTECTED]> wrote:
> >>>>> For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
>
> [..snip..]
>
> > ---
> > Currently we only do cpu accounting to userspace based on what is
> > actually happening precisely on each tick. The accuracy of that
> > accounting gets progressively worse the lower HZ is. As we already keep
> > accounting of nanosecond resolution we can accurately track user cpu,
> > nice cpu and idle cpu if we move the accounting to update_cpu_clock with
> > a nanosecond cpu_usage_stat entry. This increases overhead slightly but
> > avoids the problem of tick aliasing errors making accounting unreliable.
> >
> > Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
> > Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
>
> [..snip..]
>
> Forgot to mention. Given that this goes into the kernel, shouldn't
> Documentation/cpu-load.txt be amended/removed?

Yes that's a good idea. Also there should be a sanity check because sometimes 
for some reason noone's been able to explain to me sched_clock gives a value 
which doesn't make sense (time appears to have gone backwards) and that will 
completely ruin the accounting from then on. 

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 22:32, Gene Heskett wrote:
> On Sunday 25 March 2007, Con Kolivas wrote:
> >On Sunday 25 March 2007 21:46, Con Kolivas wrote:
> >> On Sunday 25 March 2007 21:34, malc wrote:
> >> > On Sun, 25 Mar 2007, Ingo Molnar wrote:
> >> > > * Con Kolivas <[EMAIL PROTECTED]> wrote:
> >> > >> For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
> >> > >
> >> > > we want to do this - and we should do this to the vanilla
> >> > > scheduler first and check the results. I've back-merged the patch
> >> > > to before RSDL and have tested it - find the patch below. Vale,
> >> > > could you try this patch against a 2.6.21-rc4-ish kernel and
> >> > > re-test your testcase?
> >> >
> >> > [..snip..]
> >> >
> >> > Compilation failed with:
> >> > kernel/built-in.o(.sched.text+0x564): more undefined references to
> >> > `__udivdi3' follow
> >> >
> >> > $ gcc --version | head -1
> >> > gcc (GCC) 3.4.6
> >> >
> >> > $ cat /proc/cpuinfo | grep cpu
> >> > cpu : 7447A, altivec supported
> >> >
> >> > Can't say i really understand why 64bit arithmetics suddenly became
> >> > an issue here.
> >>
> >> Probably due to use of:
> >>
> >> #define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
> >> #define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
> >>
> >> Excuse our 64bit world while we strive to correct our 32bit blindness
> >> and fix this bug.
> >
> >Please try this (akpm please don't include till we confirm it builds on
> > ppc, sorry). For 2.6.21-rc4
> >
> >---
> >Currently we only do cpu accounting to userspace based on what is
> >actually happening precisely on each tick. The accuracy of that
> >accounting gets progressively worse the lower HZ is. As we already keep
> >accounting of nanosecond resolution we can accurately track user cpu,
> >nice cpu and idle cpu if we move the accounting to update_cpu_clock with
> >a nanosecond cpu_usage_stat entry. This increases overhead slightly but
> >avoids the problem of tick aliasing errors making accounting unreliable.

>
> I'm playing again because the final 2.6.20.4 does NOT break amanda, where
> 2.6.20.4-rc1 did.

Yes only the original version I posted on this email thread was for an RSDL 
0.33 patched kernel. That original patch should build fine on i386 and x86_64 
(where I tried it). This version I sent out following Ingo's lead has 
2.6.21-rc4 in mind (without rsdl).

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 21:46, Con Kolivas wrote:
> On Sunday 25 March 2007 21:34, malc wrote:
> > On Sun, 25 Mar 2007, Ingo Molnar wrote:
> > > * Con Kolivas <[EMAIL PROTECTED]> wrote:
> > >> For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
> > >
> > > we want to do this - and we should do this to the vanilla scheduler
> > > first and check the results. I've back-merged the patch to before RSDL
> > > and have tested it - find the patch below. Vale, could you try this
> > > patch against a 2.6.21-rc4-ish kernel and re-test your testcase?
> >
> > [..snip..]
> >
> > Compilation failed with:
> > kernel/built-in.o(.sched.text+0x564): more undefined references to
> > `__udivdi3' follow
> >
> > $ gcc --version | head -1
> > gcc (GCC) 3.4.6
> >
> > $ cat /proc/cpuinfo | grep cpu
> > cpu : 7447A, altivec supported
> >
> > Can't say i really understand why 64bit arithmetics suddenly became an
> > issue here.
>
> Probably due to use of:
>
> #define NS_TO_JIFFIES(TIME)   ((TIME) / (10 / HZ))
> #define JIFFIES_TO_NS(TIME)   ((TIME) * (10 / HZ))
>
> Excuse our 64bit world while we strive to correct our 32bit blindness and
> fix this bug.

Please try this (akpm please don't include till we confirm it builds on ppc,
sorry). For 2.6.21-rc4

---
Currently we only do cpu accounting to userspace based on what is 
actually happening precisely on each tick. The accuracy of that 
accounting gets progressively worse the lower HZ is. As we already keep 
accounting of nanosecond resolution we can accurately track user cpu, 
nice cpu and idle cpu if we move the accounting to update_cpu_clock with 
a nanosecond cpu_usage_stat entry. This increases overhead slightly but 
avoids the problem of tick aliasing errors making accounting unreliable.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 include/linux/kernel_stat.h |3 ++
 include/linux/sched.h   |2 -
 kernel/sched.c  |   46 +---
 kernel/timer.c  |5 +---
 4 files changed, 49 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc4-acct/include/linux/kernel_stat.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/kernel_stat.h  2006-09-21 
19:54:58.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/kernel_stat.h   2007-03-25 
21:51:49.0 +1000
@@ -16,11 +16,14 @@
 
 struct cpu_usage_stat {
cputime64_t user;
+   cputime64_t user_ns;
cputime64_t nice;
+   cputime64_t nice_ns;
cputime64_t system;
cputime64_t softirq;
cputime64_t irq;
cputime64_t idle;
+   cputime64_t idle_ns;
cputime64_t iowait;
cputime64_t steal;
 };
Index: linux-2.6.21-rc4-acct/include/linux/sched.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/sched.h2007-03-21 
12:53:00.0 +1100
+++ linux-2.6.21-rc4-acct/include/linux/sched.h 2007-03-25 21:51:49.0 
+1000
@@ -882,7 +882,7 @@ struct task_struct {
int __user *clear_child_tid;/* CLONE_CHILD_CLEARTID */
 
unsigned long rt_priority;
-   cputime_t utime, stime;
+   cputime_t utime, utime_ns, stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time;
 /* mm fault and swap info: this can arguably be seen as either mm-specific or 
thread-specific */
Index: linux-2.6.21-rc4-acct/kernel/sched.c
===
--- linux-2.6.21-rc4-acct.orig/kernel/sched.c   2007-03-21 12:53:00.0 
+1100
+++ linux-2.6.21-rc4-acct/kernel/sched.c2007-03-25 21:58:27.0 
+1000
@@ -89,6 +89,7 @@ unsigned long long __attribute__((weak))
  */
 #define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
 #define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
+#define JIFFY_NS   JIFFIES_TO_NS(1)
 
 /*
  * These are the 'tuning knobs' of the scheduler:
@@ -3017,8 +3018,49 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 static inline void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
 {
-   p->sched_time += now - p->last_ran;
+   struct cpu_usage_stat *cpustat = _this_cpu.cpustat;
+   cputime64_t time_diff = now - p->last_ran;
+
+   p->sched_time += time_diff;
p->last_ran = rq->most_recent_timestamp = now;
+   if (p != rq->idle) {
+   cputime_t utime_diff = time_diff;
+
+   if (TASK_NICE(p) > 0) {
+   cpustat->nice_ns = cputime64_add(cpustat->nice_ns,
+

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 21:34, malc wrote:
> On Sun, 25 Mar 2007, Ingo Molnar wrote:
> > * Con Kolivas <[EMAIL PROTECTED]> wrote:
> >> For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
> >
> > we want to do this - and we should do this to the vanilla scheduler
> > first and check the results. I've back-merged the patch to before RSDL
> > and have tested it - find the patch below. Vale, could you try this
> > patch against a 2.6.21-rc4-ish kernel and re-test your testcase?
>
> [..snip..]
>
> Compilation failed with:
> kernel/built-in.o(.sched.text+0x564): more undefined references to
> `__udivdi3' follow
>
> $ gcc --version | head -1
> gcc (GCC) 3.4.6
>
> $ cat /proc/cpuinfo | grep cpu
> cpu : 7447A, altivec supported
>
> Can't say i really understand why 64bit arithmetics suddenly became an
> issue here.

Probably due to use of:

#define NS_TO_JIFFIES(TIME) ((TIME) / (10 / HZ))
#define JIFFIES_TO_NS(TIME) ((TIME) * (10 / HZ))

Excuse our 64bit world while we strive to correct our 32bit blindness and fix 
this bug.

>
> Am i supposed to run the testcase and see if numbers in `/proc/stat'
> now match the reality closer? To be really accurate `/proc/stat'
> should be left alone methinks, because no matter how good you try the
> fundamential fact that time(and consequently load percentage) is not
> really devided in USER_HZ intervals will interfere with ones quest for
> accurate statistics. (Wonder what this patch will do to slightly modified
> hog that produced this: http://www.boblycat.org/~malc/apc/load-c2d-hog.png
> but this will have to wait till i get to the PC at work)

It should far more accurately represent the cpu usage without any userspace 
changes.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 17:51, Ingo Molnar wrote:
> * Con Kolivas <[EMAIL PROTECTED]> wrote:
> > For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
>
> we want to do this - and we should do this to the vanilla scheduler
> first and check the results. I've back-merged the patch to before RSDL
> and have tested it - find the patch below. Vale, could you try this
> patch against a 2.6.21-rc4-ish kernel and re-test your testcase?

Great. That should fix a lot of misconceptions about cpu usage and HZ.

However-

> +/*
> + * Some helpers for converting nanosecond timing to jiffy resolution
> + */
> +#define NS_TO_JIFFIES(TIME)   ((TIME) / (10 / HZ))
> +#define JIFFIES_TO_NS(TIME)   ((TIME) * (10 / HZ))
> +

This hunk is already in mainline so it will be double defined now.

Thanks.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 17:51, Ingo Molnar wrote:
 * Con Kolivas [EMAIL PROTECTED] wrote:
  For an rsdl 0.33 patched kernel. Comments? Overhead worth it?

 we want to do this - and we should do this to the vanilla scheduler
 first and check the results. I've back-merged the patch to before RSDL
 and have tested it - find the patch below. Vale, could you try this
 patch against a 2.6.21-rc4-ish kernel and re-test your testcase?

Great. That should fix a lot of misconceptions about cpu usage and HZ.

However-

 +/*
 + * Some helpers for converting nanosecond timing to jiffy resolution
 + */
 +#define NS_TO_JIFFIES(TIME)   ((TIME) / (10 / HZ))
 +#define JIFFIES_TO_NS(TIME)   ((TIME) * (10 / HZ))
 +

This hunk is already in mainline so it will be double defined now.

Thanks.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 21:34, malc wrote:
 On Sun, 25 Mar 2007, Ingo Molnar wrote:
  * Con Kolivas [EMAIL PROTECTED] wrote:
  For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
 
  we want to do this - and we should do this to the vanilla scheduler
  first and check the results. I've back-merged the patch to before RSDL
  and have tested it - find the patch below. Vale, could you try this
  patch against a 2.6.21-rc4-ish kernel and re-test your testcase?

 [..snip..]

 Compilation failed with:
 kernel/built-in.o(.sched.text+0x564): more undefined references to
 `__udivdi3' follow

 $ gcc --version | head -1
 gcc (GCC) 3.4.6

 $ cat /proc/cpuinfo | grep cpu
 cpu : 7447A, altivec supported

 Can't say i really understand why 64bit arithmetics suddenly became an
 issue here.

Probably due to use of:

#define NS_TO_JIFFIES(TIME) ((TIME) / (10 / HZ))
#define JIFFIES_TO_NS(TIME) ((TIME) * (10 / HZ))

Excuse our 64bit world while we strive to correct our 32bit blindness and fix 
this bug.


 Am i supposed to run the testcase and see if numbers in `/proc/stat'
 now match the reality closer? To be really accurate `/proc/stat'
 should be left alone methinks, because no matter how good you try the
 fundamential fact that time(and consequently load percentage) is not
 really devided in USER_HZ intervals will interfere with ones quest for
 accurate statistics. (Wonder what this patch will do to slightly modified
 hog that produced this: http://www.boblycat.org/~malc/apc/load-c2d-hog.png
 but this will have to wait till i get to the PC at work)

It should far more accurately represent the cpu usage without any userspace 
changes.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 21:46, Con Kolivas wrote:
 On Sunday 25 March 2007 21:34, malc wrote:
  On Sun, 25 Mar 2007, Ingo Molnar wrote:
   * Con Kolivas [EMAIL PROTECTED] wrote:
   For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
  
   we want to do this - and we should do this to the vanilla scheduler
   first and check the results. I've back-merged the patch to before RSDL
   and have tested it - find the patch below. Vale, could you try this
   patch against a 2.6.21-rc4-ish kernel and re-test your testcase?
 
  [..snip..]
 
  Compilation failed with:
  kernel/built-in.o(.sched.text+0x564): more undefined references to
  `__udivdi3' follow
 
  $ gcc --version | head -1
  gcc (GCC) 3.4.6
 
  $ cat /proc/cpuinfo | grep cpu
  cpu : 7447A, altivec supported
 
  Can't say i really understand why 64bit arithmetics suddenly became an
  issue here.

 Probably due to use of:

 #define NS_TO_JIFFIES(TIME)   ((TIME) / (10 / HZ))
 #define JIFFIES_TO_NS(TIME)   ((TIME) * (10 / HZ))

 Excuse our 64bit world while we strive to correct our 32bit blindness and
 fix this bug.

Please try this (akpm please don't include till we confirm it builds on ppc,
sorry). For 2.6.21-rc4

---
Currently we only do cpu accounting to userspace based on what is 
actually happening precisely on each tick. The accuracy of that 
accounting gets progressively worse the lower HZ is. As we already keep 
accounting of nanosecond resolution we can accurately track user cpu, 
nice cpu and idle cpu if we move the accounting to update_cpu_clock with 
a nanosecond cpu_usage_stat entry. This increases overhead slightly but 
avoids the problem of tick aliasing errors making accounting unreliable.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]
Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 include/linux/kernel_stat.h |3 ++
 include/linux/sched.h   |2 -
 kernel/sched.c  |   46 +---
 kernel/timer.c  |5 +---
 4 files changed, 49 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc4-acct/include/linux/kernel_stat.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/kernel_stat.h  2006-09-21 
19:54:58.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/kernel_stat.h   2007-03-25 
21:51:49.0 +1000
@@ -16,11 +16,14 @@
 
 struct cpu_usage_stat {
cputime64_t user;
+   cputime64_t user_ns;
cputime64_t nice;
+   cputime64_t nice_ns;
cputime64_t system;
cputime64_t softirq;
cputime64_t irq;
cputime64_t idle;
+   cputime64_t idle_ns;
cputime64_t iowait;
cputime64_t steal;
 };
Index: linux-2.6.21-rc4-acct/include/linux/sched.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/sched.h2007-03-21 
12:53:00.0 +1100
+++ linux-2.6.21-rc4-acct/include/linux/sched.h 2007-03-25 21:51:49.0 
+1000
@@ -882,7 +882,7 @@ struct task_struct {
int __user *clear_child_tid;/* CLONE_CHILD_CLEARTID */
 
unsigned long rt_priority;
-   cputime_t utime, stime;
+   cputime_t utime, utime_ns, stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time;
 /* mm fault and swap info: this can arguably be seen as either mm-specific or 
thread-specific */
Index: linux-2.6.21-rc4-acct/kernel/sched.c
===
--- linux-2.6.21-rc4-acct.orig/kernel/sched.c   2007-03-21 12:53:00.0 
+1100
+++ linux-2.6.21-rc4-acct/kernel/sched.c2007-03-25 21:58:27.0 
+1000
@@ -89,6 +89,7 @@ unsigned long long __attribute__((weak))
  */
 #define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
 #define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
+#define JIFFY_NS   JIFFIES_TO_NS(1)
 
 /*
  * These are the 'tuning knobs' of the scheduler:
@@ -3017,8 +3018,49 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 static inline void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
 {
-   p-sched_time += now - p-last_ran;
+   struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
+   cputime64_t time_diff = now - p-last_ran;
+
+   p-sched_time += time_diff;
p-last_ran = rq-most_recent_timestamp = now;
+   if (p != rq-idle) {
+   cputime_t utime_diff = time_diff;
+
+   if (TASK_NICE(p)  0) {
+   cpustat-nice_ns = cputime64_add(cpustat-nice_ns,
+time_diff);
+   if (cpustat-nice_ns  JIFFY_NS) {
+   cpustat-nice_ns =
+   cputime64_sub(cpustat-nice_ns,
+   JIFFY_NS);
+   cpustat-nice

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 22:32, Gene Heskett wrote:
 On Sunday 25 March 2007, Con Kolivas wrote:
 On Sunday 25 March 2007 21:46, Con Kolivas wrote:
  On Sunday 25 March 2007 21:34, malc wrote:
   On Sun, 25 Mar 2007, Ingo Molnar wrote:
* Con Kolivas [EMAIL PROTECTED] wrote:
For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
   
we want to do this - and we should do this to the vanilla
scheduler first and check the results. I've back-merged the patch
to before RSDL and have tested it - find the patch below. Vale,
could you try this patch against a 2.6.21-rc4-ish kernel and
re-test your testcase?
  
   [..snip..]
  
   Compilation failed with:
   kernel/built-in.o(.sched.text+0x564): more undefined references to
   `__udivdi3' follow
  
   $ gcc --version | head -1
   gcc (GCC) 3.4.6
  
   $ cat /proc/cpuinfo | grep cpu
   cpu : 7447A, altivec supported
  
   Can't say i really understand why 64bit arithmetics suddenly became
   an issue here.
 
  Probably due to use of:
 
  #define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
  #define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
 
  Excuse our 64bit world while we strive to correct our 32bit blindness
  and fix this bug.
 
 Please try this (akpm please don't include till we confirm it builds on
  ppc, sorry). For 2.6.21-rc4
 
 ---
 Currently we only do cpu accounting to userspace based on what is
 actually happening precisely on each tick. The accuracy of that
 accounting gets progressively worse the lower HZ is. As we already keep
 accounting of nanosecond resolution we can accurately track user cpu,
 nice cpu and idle cpu if we move the accounting to update_cpu_clock with
 a nanosecond cpu_usage_stat entry. This increases overhead slightly but
 avoids the problem of tick aliasing errors making accounting unreliable.


 I'm playing again because the final 2.6.20.4 does NOT break amanda, where
 2.6.20.4-rc1 did.

Yes only the original version I posted on this email thread was for an RSDL 
0.33 patched kernel. That original patch should build fine on i386 and x86_64 
(where I tried it). This version I sent out following Ingo's lead has 
2.6.21-rc4 in mind (without rsdl).

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Sunday 25 March 2007 23:06, malc wrote:
 On Sun, 25 Mar 2007, Con Kolivas wrote:
  On Sunday 25 March 2007 21:46, Con Kolivas wrote:
  On Sunday 25 March 2007 21:34, malc wrote:
  On Sun, 25 Mar 2007, Ingo Molnar wrote:
  * Con Kolivas [EMAIL PROTECTED] wrote:
  For an rsdl 0.33 patched kernel. Comments? Overhead worth it?

 [..snip..]

  ---
  Currently we only do cpu accounting to userspace based on what is
  actually happening precisely on each tick. The accuracy of that
  accounting gets progressively worse the lower HZ is. As we already keep
  accounting of nanosecond resolution we can accurately track user cpu,
  nice cpu and idle cpu if we move the accounting to update_cpu_clock with
  a nanosecond cpu_usage_stat entry. This increases overhead slightly but
  avoids the problem of tick aliasing errors making accounting unreliable.
 
  Signed-off-by: Con Kolivas [EMAIL PROTECTED]
  Signed-off-by: Ingo Molnar [EMAIL PROTECTED]

 [..snip..]

 Forgot to mention. Given that this goes into the kernel, shouldn't
 Documentation/cpu-load.txt be amended/removed?

Yes that's a good idea. Also there should be a sanity check because sometimes 
for some reason noone's been able to explain to me sched_clock gives a value 
which doesn't make sense (time appears to have gone backwards) and that will 
completely ruin the accounting from then on. 

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 00:57, malc wrote:
 On Mon, 26 Mar 2007, Con Kolivas wrote:
  On Sunday 25 March 2007 23:06, malc wrote:
  On Sun, 25 Mar 2007, Con Kolivas wrote:
  On Sunday 25 March 2007 21:46, Con Kolivas wrote:
  On Sunday 25 March 2007 21:34, malc wrote:
  On Sun, 25 Mar 2007, Ingo Molnar wrote:
  * Con Kolivas [EMAIL PROTECTED] wrote:
  For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
 
  [..snip..]
 
  ---
  Currently we only do cpu accounting to userspace based on what is
  actually happening precisely on each tick. The accuracy of that
  accounting gets progressively worse the lower HZ is. As we already keep
  accounting of nanosecond resolution we can accurately track user cpu,
  nice cpu and idle cpu if we move the accounting to update_cpu_clock
  with a nanosecond cpu_usage_stat entry. This increases overhead
  slightly but avoids the problem of tick aliasing errors making
  accounting unreliable.
 
  Signed-off-by: Con Kolivas [EMAIL PROTECTED]
  Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
 
  [..snip..]
 
  Forgot to mention. Given that this goes into the kernel, shouldn't
  Documentation/cpu-load.txt be amended/removed?
 
  Yes that's a good idea. Also there should be a sanity check because
  sometimes for some reason noone's been able to explain to me sched_clock
  gives a value which doesn't make sense (time appears to have gone
  backwards) and that will completely ruin the accounting from then on.

 After running this new kernel for a while i guess i have hit this issue:
 http://www.boblycat.org/~malc/apc/bad-load.png

 Top and icewm's monitor do show incredibly huge load while in reality
 nothing like that is really happening. Both ad-hoc and `/proc/stat' (idle)
 show normal CPU utilization (7% since i'm doing some A/V stuff in the
 background)

Yes I'd say you hit the problem I described earlier. When playing with
sched_clock() I found it gave some interesting results fairly infrequently.
They could lead to ridiculous accounting mistakes.

So before we go any further with this patch, can you try the following one and 
see if this simple sanity check is enough?

Thanks!

---
Currently we only do cpu accounting to userspace based on what is 
actually happening precisely on each tick. The accuracy of that 
accounting gets progressively worse the lower HZ is. As we already keep 
accounting of nanosecond resolution we can accurately track user cpu, 
nice cpu and idle cpu if we move the accounting to update_cpu_clock with 
a nanosecond cpu_usage_stat entry. This increases overhead slightly but 
avoids the problem of tick aliasing errors making accounting unreliable.

Remove the now defunct Documentation/cpu-load.txt file.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]
---
 Documentation/cpu-load.txt  |  113 
 include/linux/kernel_stat.h |3 +
 include/linux/sched.h   |2 
 kernel/sched.c  |   50 ++-
 kernel/timer.c  |5 -
 5 files changed, 53 insertions(+), 120 deletions(-)

Index: linux-2.6.21-rc4-acct/include/linux/kernel_stat.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/kernel_stat.h  2007-03-26 
00:56:05.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/kernel_stat.h   2007-03-26 
00:56:25.0 +1000
@@ -16,11 +16,14 @@
 
 struct cpu_usage_stat {
cputime64_t user;
+   cputime64_t user_ns;
cputime64_t nice;
+   cputime64_t nice_ns;
cputime64_t system;
cputime64_t softirq;
cputime64_t irq;
cputime64_t idle;
+   cputime64_t idle_ns;
cputime64_t iowait;
cputime64_t steal;
 };
Index: linux-2.6.21-rc4-acct/include/linux/sched.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/sched.h2007-03-26 
00:56:05.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/sched.h 2007-03-26 00:57:01.0 
+1000
@@ -882,7 +882,7 @@ struct task_struct {
int __user *clear_child_tid;/* CLONE_CHILD_CLEARTID */
 
unsigned long rt_priority;
-   cputime_t utime, stime;
+   cputime_t utime, utime_ns, stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time;
 /* mm fault and swap info: this can arguably be seen as either mm-specific or 
thread-specific */
Index: linux-2.6.21-rc4-acct/kernel/sched.c
===
--- linux-2.6.21-rc4-acct.orig/kernel/sched.c   2007-03-26 00:56:05.0 
+1000
+++ linux-2.6.21-rc4-acct/kernel/sched.c2007-03-26 01:01:22.0 
+1000
@@ -89,6 +89,7 @@ unsigned long long __attribute__((weak))
  */
 #define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
 #define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
+#define JIFFY_NS   JIFFIES_TO_NS(1

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 01:19, malc wrote:
 On Mon, 26 Mar 2007, Con Kolivas wrote:
  So before we go any further with this patch, can you try the following
  one and see if this simple sanity check is enough?

 Sure (compiling the kernel now), too bad old axiom that testing can not
 confirm absence of bugs holds.

 I have one nit and one request from clarification. Question first (i
 admit i haven't looked at the surroundings of the patch maybe things
 would have been are self evident if i did):

 What this patch amounts to is that the accounting logic is moved from
 timer interrupt to the place where scheduler switches task (or something
 to that effect)?

Both the scheduler tick and context switch now. So yes it adds overhead as I 
said, although we already do update_cpu_clock on context switch, but it's not 
this complex.

 [..snip..]

   * These are the 'tuning knobs' of the scheduler:
  @@ -3017,8 +3018,53 @@ EXPORT_PER_CPU_SYMBOL(kstat);
  static inline void
  update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long
  now) {
  -   p-sched_time += now - p-last_ran;
  +   struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
  +   cputime64_t time_diff;
  +
  p-last_ran = rq-most_recent_timestamp = now;
  +   /* Sanity check. It should never go backwards or ruin accounting */
  +   if (unlikely(now  p-last_ran))
  +   return;
  +   time_diff = now - p-last_ran;

 A nit. Anything wrong with:

 time_diff = now - p-last_ran;
 if (unlikeley (LESS_THAN_ZERO (time_diff))
  return;

Does LESS_THAN_ZERO work on a cputime64_t on all arches? I can't figure that 
out just by looking myself which is why I did it the other way.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: debug rsdl 0.33

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 04:28, Torsten Kaiser wrote:
 On 3/24/07, Con Kolivas [EMAIL PROTECTED] wrote:
   kernel/sched.c |   51
  +++ 1 file changed, 51
  insertions(+)

 2.6.21-rc4-mm1 also fails for me.

 I tried pure 2.6.21-rc4-mm1, +hotfixes, +hotfixes+rsdl33 and at last
 also added above debug patch.

Thank you very much for the effort!

 The oops from with the debug-patch added:
 [   65.426126] Freeing unused kernel memory: 312k freed
 (on the console the system is starting up, getting until Letting udev
 process events ...)
 [   66.665611] Unable to handle kernel NULL pointer dereference at
 0020 RIP:
 [   66.682030]  [8026167c] __sched_text_start+0x4dc/0xa0e

The debug patch didn't do anything. This means it is not an unset bitmap 
problem at all otherwise it should have self corrected itself.

 The system in x86_64, two 2218 on a MCP55 nvidia chipset.

 2.6.21-rc3-mm1 works fine.

 (gdb) list *0x8026167c
 0x8026167c is in schedule (kernel/sched.c:3619).

next = list_entry(queue-next, struct task_struct, run_list);
rq-prio_level = idx;

 3614/*
 3615 * When the task is chosen it is checked to see if its
 quota has been
 3616 * added to this runqueue level which is only performed
 once per 3617 * level per major rotation for each running task.
 3618 */
 3619if (next-rotation != rq-prio_rotation) {

Urgh. Dereferencing there? That can only be next that's deferencing meaning 
the run_list entry is bogus. That should only ever be done under runqueue 
lock so I have a race somewhere where it's not. Time for more looking.

 Torsten

Thanks!

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: debug rsdl 0.33

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 08:49, Con Kolivas wrote:
 On Monday 26 March 2007 04:28, Torsten Kaiser wrote:
  On 3/24/07, Con Kolivas [EMAIL PROTECTED] wrote:
kernel/sched.c |   51
   +++ 1 file changed, 51
   insertions(+)
 
  2.6.21-rc4-mm1 also fails for me.
 
  I tried pure 2.6.21-rc4-mm1, +hotfixes, +hotfixes+rsdl33 and at last
  also added above debug patch.

 Thank you very much for the effort!

  The oops from with the debug-patch added:
  [   65.426126] Freeing unused kernel memory: 312k freed
  (on the console the system is starting up, getting until Letting udev
  process events ...)
  [   66.665611] Unable to handle kernel NULL pointer dereference at
  0020 RIP:
  [   66.682030]  [8026167c] __sched_text_start+0x4dc/0xa0e

 The debug patch didn't do anything. This means it is not an unset bitmap
 problem at all otherwise it should have self corrected itself.

  The system in x86_64, two 2218 on a MCP55 nvidia chipset.
 
  2.6.21-rc3-mm1 works fine.
 
  (gdb) list *0x8026167c
  0x8026167c is in schedule (kernel/sched.c:3619).

   next = list_entry(queue-next, struct task_struct, run_list);
   rq-prio_level = idx;

  3614/*
  3615 * When the task is chosen it is checked to see if its
  quota has been
  3616 * added to this runqueue level which is only performed
  once per 3617 * level per major rotation for each running
  task. 3618 */
  3619if (next-rotation != rq-prio_rotation) {

 Urgh. Dereferencing there? That can only be next that's deferencing meaning
 the run_list entry is bogus. That should only ever be done under runqueue
 lock so I have a race somewhere where it's not. Time for more looking.

This is about the only place I can see the run_list is looked at unlocked. Can
you see if this simple patch helps? The debug patch is unnecessary now.

Thanks!

--
Ensure checking task_queued() is only done under runqueue lock.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 kernel/sched.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-26 08:54:15.0 
+1000
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-26 08:55:21.0 +1000
@@ -3421,16 +3421,16 @@ static inline void rotate_runqueue_prior
 
 static void task_running_tick(struct rq *rq, struct task_struct *p, int tick)
 {
-   if (unlikely(!task_queued(p))) {
-   /* Task has expired but was not scheduled yet */
-   set_tsk_need_resched(p);
-   return;
-   }
/* SCHED_FIFO tasks never run out of timeslice. */
if (unlikely(p-policy == SCHED_FIFO))
return;
 
spin_lock(rq-lock);
+   if (unlikely(!task_queued(p))) {
+   /* Task has expired but was not scheduled off yet */
+   set_tsk_need_resched(p);
+   goto out_unlock;
+   }
/*
 * Accounting is performed by both the task and the runqueue. This
 * allows frequently sleeping tasks to get their proper quota of


-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 03:14, malc wrote:
 On Mon, 26 Mar 2007, Con Kolivas wrote:
  On Monday 26 March 2007 01:19, malc wrote:
  On Mon, 26 Mar 2007, Con Kolivas wrote:
  So before we go any further with this patch, can you try the following
  one and see if this simple sanity check is enough?
 
  Sure (compiling the kernel now), too bad old axiom that testing can not
  confirm absence of bugs holds.
 
  I have one nit and one request from clarification. Question first (i
  admit i haven't looked at the surroundings of the patch maybe things
  would have been are self evident if i did):
 
  What this patch amounts to is that the accounting logic is moved from
  timer interrupt to the place where scheduler switches task (or something
  to that effect)?
 
  Both the scheduler tick and context switch now. So yes it adds overhead
  as I said, although we already do update_cpu_clock on context switch, but
  it's not this complex.
 
  [..snip..]
 
   * These are the 'tuning knobs' of the scheduler:
  @@ -3017,8 +3018,53 @@ EXPORT_PER_CPU_SYMBOL(kstat);
  static inline void
  update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long
  long now) {
  - p-sched_time += now - p-last_ran;
  + struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
  + cputime64_t time_diff;
  +
p-last_ran = rq-most_recent_timestamp = now;
  + /* Sanity check. It should never go backwards or ruin accounting */
  + if (unlikely(now  p-last_ran))
  + return;
  + time_diff = now - p-last_ran;
 
  A nit. Anything wrong with:
 
  time_diff = now - p-last_ran;
  if (unlikeley (LESS_THAN_ZERO (time_diff))
   return;
 
  Does LESS_THAN_ZERO work on a cputime64_t on all arches? I can't figure
  that out just by looking myself which is why I did it the other way.

 I have no idea what type cputime64_t really is, so used this imaginary
 LESS_THAN_ZERO thing.

 Erm... i just looked at the code and suddenly it stopped making any sense
 at all:

  p-last_ran = rq-most_recent_timestamp = now;
  /* Sanity check. It should never go backwards or ruin accounting
 */ if (unlikely(now  p-last_ran))
  return;
  time_diff = now - p-last_ran;

 First `now' is assigned to `p-last_ran' and the very next line
 compares those two values, and then the difference is taken.. I quite
 frankly am either very tired or fail to see the point.. time_diff is
 either always zero or there's always a race here.

Bah major thinko error on my part! That will teach me to post patches untested 
at 1:30 am. I'll try again shortly sorry.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: accurate user accounting

2007-03-25 Thread Con Kolivas

On Monday 26 March 2007 09:01, Con Kolivas wrote:
 On Monday 26 March 2007 03:14, malc wrote:
  On Mon, 26 Mar 2007, Con Kolivas wrote:
   On Monday 26 March 2007 01:19, malc wrote:
  Erm... i just looked at the code and suddenly it stopped making any sense
  at all:
 
   p-last_ran = rq-most_recent_timestamp = now;
   /* Sanity check. It should never go backwards or ruin accounting
  */ if (unlikely(now  p-last_ran))
   return;
   time_diff = now - p-last_ran;
 
  First `now' is assigned to `p-last_ran' and the very next line
  compares those two values, and then the difference is taken.. I quite
  frankly am either very tired or fail to see the point.. time_diff is
  either always zero or there's always a race here.

 Bah major thinko error on my part! That will teach me to post patches
 untested at 1:30 am. I'll try again shortly sorry.

Ok this one is heavily tested. Please try it when you find the time.

---
Currently we only do cpu accounting to userspace based on what is 
actually happening precisely on each tick. The accuracy of that 
accounting gets progressively worse the lower HZ is. As we already keep 
accounting of nanosecond resolution we can accurately track user cpu, 
nice cpu and idle cpu if we move the accounting to update_cpu_clock with 
a nanosecond cpu_usage_stat entry. This increases overhead slightly but 
avoids the problem of tick aliasing errors making accounting unreliable.

Remove the now defunct Documentation/cpu-load.txt file.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 Documentation/cpu-load.txt  |  113 
 include/linux/kernel_stat.h |3 +
 include/linux/sched.h   |2 
 kernel/sched.c  |   58 +-
 kernel/timer.c  |5 -
 5 files changed, 60 insertions(+), 121 deletions(-)

Index: linux-2.6.21-rc4-acct/include/linux/kernel_stat.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/kernel_stat.h  2007-03-26 
00:56:05.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/kernel_stat.h   2007-03-26 
00:56:25.0 +1000
@@ -16,11 +16,14 @@
 
 struct cpu_usage_stat {
cputime64_t user;
+   cputime64_t user_ns;
cputime64_t nice;
+   cputime64_t nice_ns;
cputime64_t system;
cputime64_t softirq;
cputime64_t irq;
cputime64_t idle;
+   cputime64_t idle_ns;
cputime64_t iowait;
cputime64_t steal;
 };
Index: linux-2.6.21-rc4-acct/include/linux/sched.h
===
--- linux-2.6.21-rc4-acct.orig/include/linux/sched.h2007-03-26 
00:56:05.0 +1000
+++ linux-2.6.21-rc4-acct/include/linux/sched.h 2007-03-26 00:57:01.0 
+1000
@@ -882,7 +882,7 @@ struct task_struct {
int __user *clear_child_tid;/* CLONE_CHILD_CLEARTID */
 
unsigned long rt_priority;
-   cputime_t utime, stime;
+   cputime_t utime, utime_ns, stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time;
 /* mm fault and swap info: this can arguably be seen as either mm-specific or 
thread-specific */
Index: linux-2.6.21-rc4-acct/kernel/sched.c
===
--- linux-2.6.21-rc4-acct.orig/kernel/sched.c   2007-03-26 00:56:05.0 
+1000
+++ linux-2.6.21-rc4-acct/kernel/sched.c2007-03-26 09:38:50.0 
+1000
@@ -89,6 +89,7 @@ unsigned long long __attribute__((weak))
  */
 #define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
 #define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
+#define JIFFY_NS   JIFFIES_TO_NS(1)
 
 /*
  * These are the 'tuning knobs' of the scheduler:
@@ -3017,8 +3018,59 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 static inline void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
 {
-   p-sched_time += now - p-last_ran;
-   p-last_ran = rq-most_recent_timestamp = now;
+   struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
+   cputime64_t time_diff;
+
+   /* Sanity check. It should never go backwards or ruin accounting */
+   if (unlikely(now  p-last_ran))
+   goto out_set;
+   /* All the userspace visible cpu accounting is done here */
+   time_diff = now - p-last_ran;
+   p-sched_time += time_diff;
+   if (p != rq-idle) {
+   cputime_t utime_diff = time_diff;
+
+   if (TASK_NICE(p)  0) {
+   cpustat-nice_ns = cputime64_add(cpustat-nice_ns,
+time_diff);
+   if (cpustat-nice_ns  JIFFY_NS) {
+   cpustat-nice_ns =
+   cputime64_sub(cpustat-nice_ns,
+   JIFFY_NS

rSDl cpu scheduler version 0.34-test patch

2007-03-25 Thread Con Kolivas

This is just for testing at the moment! The reason is the size of this patch.

In the interest of evolution, I've taken the RSDL cpu scheduler and increased 
the resolution of the task timekeeping to nanosecond resolution. This removes 
the need for the runqueue rotation component entirely out of RSDL. The design 
basically is mostly unchanged, minus over 150 lines of code for the rotation, 
yet should be slightly better performing. It should be indistinguishable in 
usage from v0.33.

Other changes from v0.33:
-rr interval was not being properly scaled with HZ
-fix possible race in checking task_queued in task_running_tick
-scale down rr interval for niced tasks if HZ can tolerate it
-cull list_splice_tail

What does this mean for the next version of RSDL?

Assuming all works as expected on these test patches, it will be cleanest to 
submit a new series of patches for -mm with the renamed Staircase-Deadline 
scheduler and new documentation (when it's done).


So for testing here are full rollups for 2.6.20.4 and 2.6.21-rc4:
http://ck.kolivas.org/patches/staircase-deadline/2.6.20.4-sd-0.34-test.patch
http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-sd-0.34-test.patch

The patches available also include a rollup of sched: accurate user accounting 
as this code touches the same area and it is most convenient to include them 
together.

(incrementals in each subdir of staircase-deadline/ for those interested).

Thanks Mike for continuing to attempt to use the cluebat on me on this one. 
From the start I wasn't sure if this was necessary or not but ends up being 
less code than RSDL.

While I'm still far from being well, luckily I am in much better shape to be 
able to spend the time at the pc to have done this. Thanks to all those who 
expressed their concern.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL 0.31 causes slowdown

2007-03-25 Thread Con Kolivas

On Saturday 24 March 2007 04:57, Tim Chen wrote:
 On Fri, 2007-03-23 at 13:40 +1100, Con Kolivas wrote:
  Volanomark is a purely yield() semantic dependant workload (as
  discussed many times previously). In the earlier form of RSDL I
  softened the effect of sched_yield but other changes since then have
  made that softness bordering on a noop. Obviously when sched_yield is
  relied upon that will not be enough. Extending the rr interval simply
  makes the yield slightly more effective and is not the proper
  workaround. Since expiration of arrays is a regular frequent
  occurrence in RSDL then changing yield semantics back to expiration
  should cause a massive improvement in these values, without making the
  yields as long as in mainline. It's impossible to know exactly what
  the final result will be since java uses this timing sensitive yield
  for locking but we can improve it drastically from this. I'll make a
  patch soon to change yield again.

 Con,

 The new RSDL 0.33 has fully recovered the loss in performance for
 Volanomark.  The throughput for Volanomark is at the same level as
 mainline 2.6.21-rc4 kernel.

 Tim

Thanks very much for testing. I'm quite happy with the yield semantics staying 
the way they are in rSDl 0.33+.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [RFC] sched: accurate user accounting

2007-03-24 Thread Con Kolivas

On Sunday 25 March 2007 11:59, Con Kolivas wrote:
> For an rsdl 0.33 patched kernel. Comments? Overhead worth it?
>
> ---
> Currently we only do cpu accounting to userspace based on what is actually
> happening precisely on each tick. The accuracy of that accounting gets
> progressively worse the lower HZ is. As we already keep accounting of
> nanosecond resolution we can accurately track user cpu, nice cpu and idle
> cpu if we move the accounting to update_cpu_clock with a nanosecond
> cpu_usage_stat entry. This increases overhead slightly but avoids the
> problem of tick aliasing errors making accounting unreliable.

Vale, this fixes your testcase you sent. Attached below for reference. 

P.S. Sorry about one of the cc email addresses in the first email; I succumbed 
to a silly practical joke unwittingly so you'll have to remove it when 
replying to all.

/* gcc -o hog smallhog.c */
#include 
#include 
#include 
#include 

#define HIST 10

static sig_atomic_t stop;

static void sighandler (int signr)
{
 (void) signr;
 stop = 1;
}

static unsigned long hog (unsigned long niters)
{
 stop = 0;
 while (!stop && --niters);
 return niters;
}

int main (void)
{
 int i;
 struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
 .it_value = { .tv_sec = 0, .tv_usec = 1 } };
 sigset_t set;
 unsigned long v[HIST];
 double tmp = 0.0;
 unsigned long n;

 signal (SIGALRM, );
 setitimer (ITIMER_REAL, , NULL);

 for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
 for (i = 0; i < HIST; ++i) tmp += v[i];
 tmp /= HIST;
 n = tmp - (tmp / 3.0);

 sigemptyset ();
 sigaddset (, SIGALRM);

 for (;;) {
 hog (n);
 sigwait (, );
 }
 return 0;
}

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] [RFC] sched: accurate user accounting

2007-03-24 Thread Con Kolivas

For an rsdl 0.33 patched kernel. Comments? Overhead worth it?

---
Currently we only do cpu accounting to userspace based on what is actually
happening precisely on each tick. The accuracy of that accounting gets
progressively worse the lower HZ is. As we already keep accounting of
nanosecond resolution we can accurately track user cpu, nice cpu and idle cpu
if we move the accounting to update_cpu_clock with a nanosecond cpu_usage_stat
entry. This increases overhead slightly but avoids the problem of tick
aliasing errors making accounting unreliable.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 include/linux/kernel_stat.h |3 ++
 include/linux/sched.h   |2 -
 kernel/sched.c  |   51 +---
 kernel/timer.c  |5 +---
 4 files changed, 54 insertions(+), 7 deletions(-)

Index: linux-2.6.20.4-ck1/include/linux/kernel_stat.h
===
--- linux-2.6.20.4-ck1.orig/include/linux/kernel_stat.h 2007-03-25 
09:47:52.0 +1000
+++ linux-2.6.20.4-ck1/include/linux/kernel_stat.h  2007-03-25 
11:31:29.0 +1000
@@ -16,11 +16,14 @@
 
 struct cpu_usage_stat {
cputime64_t user;
+   cputime64_t user_ns;
cputime64_t nice;
+   cputime64_t nice_ns;
cputime64_t system;
cputime64_t softirq;
cputime64_t irq;
cputime64_t idle;
+   cputime64_t idle_ns;
cputime64_t iowait;
cputime64_t steal;
 };
Index: linux-2.6.20.4-ck1/kernel/sched.c
===
--- linux-2.6.20.4-ck1.orig/kernel/sched.c  2007-03-25 09:47:56.0 
+1000
+++ linux-2.6.20.4-ck1/kernel/sched.c   2007-03-25 11:42:28.0 +1000
@@ -77,6 +77,11 @@
 #define MAX_USER_PRIO  (USER_PRIO(MAX_PRIO))
 #define SCHED_PRIO(p)  ((p)+MAX_RT_PRIO)
 
+/*
+ * Some helpers for converting nanosecond timing to jiffy resolution
+ */
+#define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
+#define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
 #define TASK_PREEMPTS_CURR(p, curr)((p)->prio < (curr)->prio)
 
 /*
@@ -2993,8 +2998,50 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 static inline void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
 {
-   p->sched_time += now - p->last_ran;
+   struct cpu_usage_stat *cpustat = _this_cpu.cpustat;
+   cputime64_t time_diff = now - p->last_ran;
+
+   p->sched_time += time_diff;
p->last_ran = rq->most_recent_timestamp = now;
+   if (p != rq->idle) {
+   cputime_t utime_diff = time_diff;
+
+   if (TASK_NICE(p) > 0) {
+   cpustat->nice_ns = cputime64_add(cpustat->nice_ns,
+time_diff);
+   if (NS_TO_JIFFIES(cpustat->nice_ns) > 1) {
+   cpustat->nice_ns =
+   cputime64_sub(cpustat->nice_ns,
+   JIFFIES_TO_NS(1));
+   cpustat->nice =
+   cputime64_add(cpustat->nice, 1);
+   }
+   } else {
+   cpustat->user_ns = cputime64_add(cpustat->user_ns,
+   time_diff);
+   if (NS_TO_JIFFIES(cpustat->user_ns) > 1) {
+   cpustat->user_ns =
+   cputime64_sub(cpustat->user_ns,
+   JIFFIES_TO_NS(1));
+   cpustat ->user =
+   cputime64_add(cpustat->user, 1);
+   }
+   }
+   p->utime_ns = cputime_add(p->utime_ns, utime_diff);
+   if (NS_TO_JIFFIES(p->utime_ns) > 1) {
+   p->utime_ns = cputime_sub(p->utime_ns,
+ JIFFIES_TO_NS(1));
+   p->utime = cputime_add(p->utime,
+  jiffies_to_cputime(1));
+   }
+   } else {
+   cpustat->idle_ns = cputime64_add(cpustat->idle_ns, time_diff);
+   if (NS_TO_JIFFIES(cpustat->idle_ns) > 1) {
+   cpustat->idle_ns = cputime64_sub(cpustat->idle_ns,
+JIFFIES_TO_NS(1));
+   cpustat->idle = cputime64_add(cpustat->idle, 1);
+   }
+   }
 }
 
 /*
@@ -3059,8 +3106,6 @@ void account_system_time(struct task_str
cpustat->system = cputime64_add(cpustat->system, tmp);
else if (atomic_read(>nr_iowait) > 0)
cpusta

[PATCH] [RFC] sched: accurate user accounting

2007-03-24 Thread Con Kolivas

For an rsdl 0.33 patched kernel. Comments? Overhead worth it?

---
Currently we only do cpu accounting to userspace based on what is actually
happening precisely on each tick. The accuracy of that accounting gets
progressively worse the lower HZ is. As we already keep accounting of
nanosecond resolution we can accurately track user cpu, nice cpu and idle cpu
if we move the accounting to update_cpu_clock with a nanosecond cpu_usage_stat
entry. This increases overhead slightly but avoids the problem of tick
aliasing errors making accounting unreliable.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 include/linux/kernel_stat.h |3 ++
 include/linux/sched.h   |2 -
 kernel/sched.c  |   51 +---
 kernel/timer.c  |5 +---
 4 files changed, 54 insertions(+), 7 deletions(-)

Index: linux-2.6.20.4-ck1/include/linux/kernel_stat.h
===
--- linux-2.6.20.4-ck1.orig/include/linux/kernel_stat.h 2007-03-25 
09:47:52.0 +1000
+++ linux-2.6.20.4-ck1/include/linux/kernel_stat.h  2007-03-25 
11:31:29.0 +1000
@@ -16,11 +16,14 @@
 
 struct cpu_usage_stat {
cputime64_t user;
+   cputime64_t user_ns;
cputime64_t nice;
+   cputime64_t nice_ns;
cputime64_t system;
cputime64_t softirq;
cputime64_t irq;
cputime64_t idle;
+   cputime64_t idle_ns;
cputime64_t iowait;
cputime64_t steal;
 };
Index: linux-2.6.20.4-ck1/kernel/sched.c
===
--- linux-2.6.20.4-ck1.orig/kernel/sched.c  2007-03-25 09:47:56.0 
+1000
+++ linux-2.6.20.4-ck1/kernel/sched.c   2007-03-25 11:42:28.0 +1000
@@ -77,6 +77,11 @@
 #define MAX_USER_PRIO  (USER_PRIO(MAX_PRIO))
 #define SCHED_PRIO(p)  ((p)+MAX_RT_PRIO)
 
+/*
+ * Some helpers for converting nanosecond timing to jiffy resolution
+ */
+#define NS_TO_JIFFIES(TIME)((TIME) / (10 / HZ))
+#define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
 #define TASK_PREEMPTS_CURR(p, curr)((p)-prio  (curr)-prio)
 
 /*
@@ -2993,8 +2998,50 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 static inline void
 update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
 {
-   p-sched_time += now - p-last_ran;
+   struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
+   cputime64_t time_diff = now - p-last_ran;
+
+   p-sched_time += time_diff;
p-last_ran = rq-most_recent_timestamp = now;
+   if (p != rq-idle) {
+   cputime_t utime_diff = time_diff;
+
+   if (TASK_NICE(p)  0) {
+   cpustat-nice_ns = cputime64_add(cpustat-nice_ns,
+time_diff);
+   if (NS_TO_JIFFIES(cpustat-nice_ns)  1) {
+   cpustat-nice_ns =
+   cputime64_sub(cpustat-nice_ns,
+   JIFFIES_TO_NS(1));
+   cpustat-nice =
+   cputime64_add(cpustat-nice, 1);
+   }
+   } else {
+   cpustat-user_ns = cputime64_add(cpustat-user_ns,
+   time_diff);
+   if (NS_TO_JIFFIES(cpustat-user_ns)  1) {
+   cpustat-user_ns =
+   cputime64_sub(cpustat-user_ns,
+   JIFFIES_TO_NS(1));
+   cpustat -user =
+   cputime64_add(cpustat-user, 1);
+   }
+   }
+   p-utime_ns = cputime_add(p-utime_ns, utime_diff);
+   if (NS_TO_JIFFIES(p-utime_ns)  1) {
+   p-utime_ns = cputime_sub(p-utime_ns,
+ JIFFIES_TO_NS(1));
+   p-utime = cputime_add(p-utime,
+  jiffies_to_cputime(1));
+   }
+   } else {
+   cpustat-idle_ns = cputime64_add(cpustat-idle_ns, time_diff);
+   if (NS_TO_JIFFIES(cpustat-idle_ns)  1) {
+   cpustat-idle_ns = cputime64_sub(cpustat-idle_ns,
+JIFFIES_TO_NS(1));
+   cpustat-idle = cputime64_add(cpustat-idle, 1);
+   }
+   }
 }
 
 /*
@@ -3059,8 +3106,6 @@ void account_system_time(struct task_str
cpustat-system = cputime64_add(cpustat-system, tmp);
else if (atomic_read(rq-nr_iowait)  0)
cpustat-iowait = cputime64_add(cpustat-iowait, tmp);
-   else
-   cpustat-idle = cputime64_add(cpustat-idle, tmp);
/* Account for system time used

Re: [PATCH] [RFC] sched: accurate user accounting

2007-03-24 Thread Con Kolivas

On Sunday 25 March 2007 11:59, Con Kolivas wrote:
 For an rsdl 0.33 patched kernel. Comments? Overhead worth it?

 ---
 Currently we only do cpu accounting to userspace based on what is actually
 happening precisely on each tick. The accuracy of that accounting gets
 progressively worse the lower HZ is. As we already keep accounting of
 nanosecond resolution we can accurately track user cpu, nice cpu and idle
 cpu if we move the accounting to update_cpu_clock with a nanosecond
 cpu_usage_stat entry. This increases overhead slightly but avoids the
 problem of tick aliasing errors making accounting unreliable.

Vale, this fixes your testcase you sent. Attached below for reference. 

P.S. Sorry about one of the cc email addresses in the first email; I succumbed 
to a silly practical joke unwittingly so you'll have to remove it when 
replying to all.

/* gcc -o hog smallhog.c */
#include time.h
#include limits.h
#include signal.h
#include sys/time.h

#define HIST 10

static sig_atomic_t stop;

static void sighandler (int signr)
{
 (void) signr;
 stop = 1;
}

static unsigned long hog (unsigned long niters)
{
 stop = 0;
 while (!stop  --niters);
 return niters;
}

int main (void)
{
 int i;
 struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
 .it_value = { .tv_sec = 0, .tv_usec = 1 } };
 sigset_t set;
 unsigned long v[HIST];
 double tmp = 0.0;
 unsigned long n;

 signal (SIGALRM, sighandler);
 setitimer (ITIMER_REAL, it, NULL);

 for (i = 0; i  HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
 for (i = 0; i  HIST; ++i) tmp += v[i];
 tmp /= HIST;
 n = tmp - (tmp / 3.0);

 sigemptyset (set);
 sigaddset (set, SIGALRM);

 for (;;) {
 hog (n);
 sigwait (set, i);
 }
 return 0;
}

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

debug rsdl 0.33

2007-03-23 Thread Con Kolivas

On Saturday 24 March 2007 08:45, Con Kolivas wrote:
> On Friday 23 March 2007 23:28, Andy Whitcroft wrote:
> > Andy Whitcroft wrote:
> > > Con Kolivas wrote:
> > >> On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
> > >>> Ok, I have yet a third x86_64 machine is is blowing up with the
> > >>> latest 2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
> > >>> 2.6.21-rc4-mm1+hotfixes-RSDL.  I have results on various hotfix
> > >>> levels so I have just fired off a set of tests across the affected
> > >>> machines on that latest hotfix stack plus the RSDL backout and the
> > >>> results should be in in the next hour or two.
> > >>>
> > >>> I think there is a strong correlation between RSDL and these hangs.
> > >>> Any suggestions as to the next step.
> > >>
> > >> Found a nasty in requeue_task
> > >> +if (list_empty(old_array->queue + old_prio))
> > >> +__clear_bit(old_prio, p->array->prio_bitmap);
> > >>
> > >> see anything wrong there? I do :P
> > >>
> > >> I'll queue that up with the other changes pending and hopefully that
> > >> will fix your bug.
> > >
> > > Tests queued with your rdsl-0.33 patch (I am assuming its in there).
> > > Will let you know how it looks.
> >
> > Hmmm, this is good for the original machine (as was 0.32) but not for
> > either of the other two.  I am seeing panics as below on those two.
>
> This machine seems most sensitive to it (first column):
> elm3b6
> amd64
> newisys
> 4cpu
> config: amd64
>
> Can you throw this debugging patch at it please? The console output might
> be very helpful. On top of sched-rsdl-0.33 thanks!

Better yet this one which checks the expired array as well and after 
pull_task.

If anyone's getting a bug they think might be due to rsdl please try this (on 
rsdl 0.33).

---
 kernel/sched.c |   51 +++
 1 file changed, 51 insertions(+)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-24 08:32:19.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-24 10:22:59.0 +1100
@@ -659,6 +659,35 @@ static inline void set_task_entitlement(
p->time_slice = p->quota;
 }
 
+static int debug_rqbitmap(struct rq *rq)
+{
+   struct list_head *queue;
+   int idx = 0, error = 0;
+   struct prio_array *array;
+
+   for (idx = 0; idx < MAX_PRIO; idx++) {
+   array = rq->active;
+   queue = array->queue + idx;
+   if (!list_empty(queue)) {
+   if (!test_bit(idx, rq->dyn_bitmap)) {
+   __set_bit(idx, rq->dyn_bitmap);
+   error = 1;
+   printk(KERN_ERR "MISSING DYNAMIC BIT %d\n", 
idx);
+   }
+   }
+   array = rq->expired;
+   queue = array->queue + idx;
+   if (!list_empty(queue)) {
+   if (!test_bit(idx, rq->exp_bitmap)) {
+   __set_bit(idx, rq->exp_bitmap);
+   error = 1;
+   printk(KERN_ERR "MISSING EXPIRED BIT %d\n", 
idx);
+   }
+   }
+   }
+   return error;
+}
+
 /*
  * There is no specific hard accounting. The dynamic bits can have
  * false positives. rt_tasks can only be on the active queue.
@@ -679,6 +708,7 @@ static void dequeue_task(struct task_str
list_del_init(>run_list);
if (list_empty(p->array->queue + p->prio))
__clear_bit(p->prio, p->array->prio_bitmap);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -797,12 +827,14 @@ static void enqueue_task(struct task_str
 {
__enqueue_task(p, rq);
list_add_tail(>run_list, p->array->queue + p->prio);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 static inline void enqueue_task_head(struct task_struct *p, struct rq *rq)
 {
__enqueue_task(p, rq);
list_add(>run_list, p->array->queue + p->prio);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -820,6 +852,7 @@ static void requeue_task(struct task_str
__clear_bit(old_prio, old_array->prio_bitmap);
set_dynamic_bit(p, rq);
}
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -906,6 +939,7 @@ static inline void __activate_task(struc
 {
enqueue_task(p, rq);
inc_nr_running(p, rq);
+   WARN_ON(debug_r

Re: 2.6.21-rc4-mm1

2007-03-23 Thread Con Kolivas

On Friday 23 March 2007 23:28, Andy Whitcroft wrote:
> Andy Whitcroft wrote:
> > Con Kolivas wrote:
> >> On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
> >>> Ok, I have yet a third x86_64 machine is is blowing up with the latest
> >>> 2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
> >>> 2.6.21-rc4-mm1+hotfixes-RSDL.  I have results on various hotfix levels
> >>> so I have just fired off a set of tests across the affected machines on
> >>> that latest hotfix stack plus the RSDL backout and the results should
> >>> be in in the next hour or two.
> >>>
> >>> I think there is a strong correlation between RSDL and these hangs. 
> >>> Any suggestions as to the next step.
> >>
> >> Found a nasty in requeue_task
> >> +  if (list_empty(old_array->queue + old_prio))
> >> +  __clear_bit(old_prio, p->array->prio_bitmap);
> >>
> >> see anything wrong there? I do :P
> >>
> >> I'll queue that up with the other changes pending and hopefully that
> >> will fix your bug.
> >
> > Tests queued with your rdsl-0.33 patch (I am assuming its in there).
> > Will let you know how it looks.
>
> Hmmm, this is good for the original machine (as was 0.32) but not for
> either of the other two.  I am seeing panics as below on those two.

This machine seems most sensitive to it (first column):
elm3b6
amd64
newisys
4cpu
config: amd64

Can you throw this debugging patch at it please? The console output might be 
very helpful. On top of sched-rsdl-0.33 thanks!

---
 kernel/sched.c |   39 +++
 1 file changed, 39 insertions(+)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-24 08:32:19.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-24 08:42:04.0 +1100
@@ -659,6 +659,25 @@ static inline void set_task_entitlement(
p->time_slice = p->quota;
 }
 
+static int debug_rqbitmap(struct rq *rq)
+{
+   struct list_head *queue;
+   int idx = 0, error = 0;
+   struct prio_array *array = rq->active;
+
+   for (idx = 0; idx < MAX_PRIO; idx++) {
+   queue = array->queue + idx;
+   if (!list_empty(queue)) {
+   if (!test_bit(idx, rq->dyn_bitmap)) {
+   __set_bit(idx, rq->dyn_bitmap);
+   error = 1;
+   printk(KERN_ERR "MISSING DYNAMIC BIT %d\n", 
idx);
+   }
+   }
+   }
+   return error;
+}
+
 /*
  * There is no specific hard accounting. The dynamic bits can have
  * false positives. rt_tasks can only be on the active queue.
@@ -679,6 +698,7 @@ static void dequeue_task(struct task_str
list_del_init(>run_list);
if (list_empty(p->array->queue + p->prio))
__clear_bit(p->prio, p->array->prio_bitmap);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -797,12 +817,14 @@ static void enqueue_task(struct task_str
 {
__enqueue_task(p, rq);
list_add_tail(>run_list, p->array->queue + p->prio);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 static inline void enqueue_task_head(struct task_struct *p, struct rq *rq)
 {
__enqueue_task(p, rq);
list_add(>run_list, p->array->queue + p->prio);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -820,6 +842,7 @@ static void requeue_task(struct task_str
__clear_bit(old_prio, old_array->prio_bitmap);
set_dynamic_bit(p, rq);
}
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -906,6 +929,7 @@ static inline void __activate_task(struc
 {
enqueue_task(p, rq);
inc_nr_running(p, rq);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -1006,6 +1030,7 @@ static void deactivate_task(struct task_
 {
dec_nr_running(p, rq);
dequeue_task(p, rq);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -1718,9 +1743,11 @@ void fastcall wake_up_new_task(struct ta
 * Parent and child are on different CPUs, now get the
 * parent runqueue to update the parent's ->flags:
 */
+   WARN_ON(debug_rqbitmap(rq));
task_rq_unlock(rq, );
this_rq = task_rq_lock(current, );
}
+   WARN_ON(debug_rqbitmap(this_rq));
task_rq_unlock(this_rq, );
 }
 
@@ -3357,6 +3384,7 @@ static inline void major_prio_rotation(s
rq->dyn_bitmap = rq->active->prio_bitmap;
rq->best_static_prio = MAX_PRIO - 1;
rq->prio_rotation++;
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -3399,6 +3427,8 @@ static inline

RSDL cpu scheduler v 0.33

2007-03-23 Thread Con Kolivas

Latest version of RSDL cpu scheduler (v0.33) for various trees available here:

http://ck.kolivas.org/patches/staircase-deadline/

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RSDL cpu scheduler v 0.33

2007-03-23 Thread Con Kolivas

Latest version of RSDL cpu scheduler (v0.33) for various trees available here:

http://ck.kolivas.org/patches/staircase-deadline/

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc4-mm1

2007-03-23 Thread Con Kolivas

On Friday 23 March 2007 23:28, Andy Whitcroft wrote:
 Andy Whitcroft wrote:
  Con Kolivas wrote:
  On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
  Ok, I have yet a third x86_64 machine is is blowing up with the latest
  2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
  2.6.21-rc4-mm1+hotfixes-RSDL.  I have results on various hotfix levels
  so I have just fired off a set of tests across the affected machines on
  that latest hotfix stack plus the RSDL backout and the results should
  be in in the next hour or two.
 
  I think there is a strong correlation between RSDL and these hangs. 
  Any suggestions as to the next step.
 
  Found a nasty in requeue_task
  +  if (list_empty(old_array-queue + old_prio))
  +  __clear_bit(old_prio, p-array-prio_bitmap);
 
  see anything wrong there? I do :P
 
  I'll queue that up with the other changes pending and hopefully that
  will fix your bug.
 
  Tests queued with your rdsl-0.33 patch (I am assuming its in there).
  Will let you know how it looks.

 Hmmm, this is good for the original machine (as was 0.32) but not for
 either of the other two.  I am seeing panics as below on those two.

This machine seems most sensitive to it (first column):
elm3b6
amd64
newisys
4cpu
config: amd64

Can you throw this debugging patch at it please? The console output might be 
very helpful. On top of sched-rsdl-0.33 thanks!

---
 kernel/sched.c |   39 +++
 1 file changed, 39 insertions(+)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-24 08:32:19.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-24 08:42:04.0 +1100
@@ -659,6 +659,25 @@ static inline void set_task_entitlement(
p-time_slice = p-quota;
 }
 
+static int debug_rqbitmap(struct rq *rq)
+{
+   struct list_head *queue;
+   int idx = 0, error = 0;
+   struct prio_array *array = rq-active;
+
+   for (idx = 0; idx  MAX_PRIO; idx++) {
+   queue = array-queue + idx;
+   if (!list_empty(queue)) {
+   if (!test_bit(idx, rq-dyn_bitmap)) {
+   __set_bit(idx, rq-dyn_bitmap);
+   error = 1;
+   printk(KERN_ERR MISSING DYNAMIC BIT %d\n, 
idx);
+   }
+   }
+   }
+   return error;
+}
+
 /*
  * There is no specific hard accounting. The dynamic bits can have
  * false positives. rt_tasks can only be on the active queue.
@@ -679,6 +698,7 @@ static void dequeue_task(struct task_str
list_del_init(p-run_list);
if (list_empty(p-array-queue + p-prio))
__clear_bit(p-prio, p-array-prio_bitmap);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -797,12 +817,14 @@ static void enqueue_task(struct task_str
 {
__enqueue_task(p, rq);
list_add_tail(p-run_list, p-array-queue + p-prio);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 static inline void enqueue_task_head(struct task_struct *p, struct rq *rq)
 {
__enqueue_task(p, rq);
list_add(p-run_list, p-array-queue + p-prio);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -820,6 +842,7 @@ static void requeue_task(struct task_str
__clear_bit(old_prio, old_array-prio_bitmap);
set_dynamic_bit(p, rq);
}
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -906,6 +929,7 @@ static inline void __activate_task(struc
 {
enqueue_task(p, rq);
inc_nr_running(p, rq);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -1006,6 +1030,7 @@ static void deactivate_task(struct task_
 {
dec_nr_running(p, rq);
dequeue_task(p, rq);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -1718,9 +1743,11 @@ void fastcall wake_up_new_task(struct ta
 * Parent and child are on different CPUs, now get the
 * parent runqueue to update the parent's -flags:
 */
+   WARN_ON(debug_rqbitmap(rq));
task_rq_unlock(rq, flags);
this_rq = task_rq_lock(current, flags);
}
+   WARN_ON(debug_rqbitmap(this_rq));
task_rq_unlock(this_rq, flags);
 }
 
@@ -3357,6 +3384,7 @@ static inline void major_prio_rotation(s
rq-dyn_bitmap = rq-active-prio_bitmap;
rq-best_static_prio = MAX_PRIO - 1;
rq-prio_rotation++;
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -3399,6 +3427,8 @@ static inline void rotate_runqueue_prior
}
memset(rq-prio_quota, 0, ARRAY_SIZE(rq-prio_quota));
major_prio_rotation(rq);
+   WARN_ON(debug_rqbitmap(rq));
+
} else {
/* Minor rotation */
new_prio_level = rq-prio_level + 1;
@@ -3409,6 +3439,7 @@ static inline void rotate_runqueue_prior
__set_bit(new_prio_level, rq-dyn_bitmap

debug rsdl 0.33

2007-03-23 Thread Con Kolivas

On Saturday 24 March 2007 08:45, Con Kolivas wrote:
 On Friday 23 March 2007 23:28, Andy Whitcroft wrote:
  Andy Whitcroft wrote:
   Con Kolivas wrote:
   On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
   Ok, I have yet a third x86_64 machine is is blowing up with the
   latest 2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
   2.6.21-rc4-mm1+hotfixes-RSDL.  I have results on various hotfix
   levels so I have just fired off a set of tests across the affected
   machines on that latest hotfix stack plus the RSDL backout and the
   results should be in in the next hour or two.
  
   I think there is a strong correlation between RSDL and these hangs.
   Any suggestions as to the next step.
  
   Found a nasty in requeue_task
   +if (list_empty(old_array-queue + old_prio))
   +__clear_bit(old_prio, p-array-prio_bitmap);
  
   see anything wrong there? I do :P
  
   I'll queue that up with the other changes pending and hopefully that
   will fix your bug.
  
   Tests queued with your rdsl-0.33 patch (I am assuming its in there).
   Will let you know how it looks.
 
  Hmmm, this is good for the original machine (as was 0.32) but not for
  either of the other two.  I am seeing panics as below on those two.

 This machine seems most sensitive to it (first column):
 elm3b6
 amd64
 newisys
 4cpu
 config: amd64

 Can you throw this debugging patch at it please? The console output might
 be very helpful. On top of sched-rsdl-0.33 thanks!

Better yet this one which checks the expired array as well and after 
pull_task.

If anyone's getting a bug they think might be due to rsdl please try this (on 
rsdl 0.33).

---
 kernel/sched.c |   51 +++
 1 file changed, 51 insertions(+)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-24 08:32:19.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-24 10:22:59.0 +1100
@@ -659,6 +659,35 @@ static inline void set_task_entitlement(
p-time_slice = p-quota;
 }
 
+static int debug_rqbitmap(struct rq *rq)
+{
+   struct list_head *queue;
+   int idx = 0, error = 0;
+   struct prio_array *array;
+
+   for (idx = 0; idx  MAX_PRIO; idx++) {
+   array = rq-active;
+   queue = array-queue + idx;
+   if (!list_empty(queue)) {
+   if (!test_bit(idx, rq-dyn_bitmap)) {
+   __set_bit(idx, rq-dyn_bitmap);
+   error = 1;
+   printk(KERN_ERR MISSING DYNAMIC BIT %d\n, 
idx);
+   }
+   }
+   array = rq-expired;
+   queue = array-queue + idx;
+   if (!list_empty(queue)) {
+   if (!test_bit(idx, rq-exp_bitmap)) {
+   __set_bit(idx, rq-exp_bitmap);
+   error = 1;
+   printk(KERN_ERR MISSING EXPIRED BIT %d\n, 
idx);
+   }
+   }
+   }
+   return error;
+}
+
 /*
  * There is no specific hard accounting. The dynamic bits can have
  * false positives. rt_tasks can only be on the active queue.
@@ -679,6 +708,7 @@ static void dequeue_task(struct task_str
list_del_init(p-run_list);
if (list_empty(p-array-queue + p-prio))
__clear_bit(p-prio, p-array-prio_bitmap);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -797,12 +827,14 @@ static void enqueue_task(struct task_str
 {
__enqueue_task(p, rq);
list_add_tail(p-run_list, p-array-queue + p-prio);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 static inline void enqueue_task_head(struct task_struct *p, struct rq *rq)
 {
__enqueue_task(p, rq);
list_add(p-run_list, p-array-queue + p-prio);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -820,6 +852,7 @@ static void requeue_task(struct task_str
__clear_bit(old_prio, old_array-prio_bitmap);
set_dynamic_bit(p, rq);
}
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -906,6 +939,7 @@ static inline void __activate_task(struc
 {
enqueue_task(p, rq);
inc_nr_running(p, rq);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -1006,6 +1040,7 @@ static void deactivate_task(struct task_
 {
dec_nr_running(p, rq);
dequeue_task(p, rq);
+   WARN_ON(debug_rqbitmap(rq));
 }
 
 /*
@@ -1718,9 +1753,11 @@ void fastcall wake_up_new_task(struct ta
 * Parent and child are on different CPUs, now get the
 * parent runqueue to update the parent's -flags:
 */
+   WARN_ON(debug_rqbitmap(rq));
task_rq_unlock(rq, flags);
this_rq = task_rq_lock(current, flags);
}
+   WARN_ON(debug_rqbitmap(this_rq));
task_rq_unlock

[PATCH] sched: rsdl yet more fixes

2007-03-22 Thread Con Kolivas

This one should hopefully fix Andy's bug.

To be queued on top of what's already in -mm please. Will make v.33 with these
changes for other trees soon.

---
The wrong bit could be unset on requeue_task which could cause an oops.
Fix that.

sched_yield semantics became almost a noop so change back to expiring tasks
when yield is called.

recalc_task_prio() performed during pull_task() on SMP may not reliably
be doing the right thing to tasks queued on the new runqueue. Add a
special variant of enqueue_task that does its own local recalculation of
priority and quota.

rq->best_static_prio should not be set by realtime or SCHED_BATCH tasks.
Correct that, and microoptimise the code around setting best_static_prio.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 kernel/sched.c |  103 +++--
 1 file changed, 71 insertions(+), 32 deletions(-)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-23 11:28:25.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-23 17:28:19.0 +1100
@@ -714,17 +714,17 @@ static inline int entitled_slot(int stat
  */
 static inline int next_entitled_slot(struct task_struct *p, struct rq *rq)
 {
-   if (p->static_prio < rq->best_static_prio && p->policy != SCHED_BATCH)
-   return SCHED_PRIO(find_first_zero_bit(p->bitmap, PRIO_RANGE));
-   else {
-   DECLARE_BITMAP(tmp, PRIO_RANGE);
+   DECLARE_BITMAP(tmp, PRIO_RANGE);
+   int search_prio;
 
-   bitmap_or(tmp, p->bitmap,
- prio_matrix[USER_PRIO(p->static_prio)],
- PRIO_RANGE);
-   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
-   USER_PRIO(rq->prio_level)));
-   }
+   if (p->static_prio < rq->best_static_prio && p->policy != SCHED_BATCH)
+   search_prio = MAX_RT_PRIO;
+   else
+   search_prio = rq->prio_level;
+   bitmap_or(tmp, p->bitmap, prio_matrix[USER_PRIO(p->static_prio)],
+ PRIO_RANGE);
+   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
+   USER_PRIO(search_prio)));
 }
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
@@ -817,7 +817,7 @@ static void requeue_task(struct task_str
list_move_tail(>run_list, p->array->queue + p->prio);
if (!rt_task(p)) {
if (list_empty(old_array->queue + old_prio))
-   __clear_bit(old_prio, p->array->prio_bitmap);
+   __clear_bit(old_prio, old_array->prio_bitmap);
set_dynamic_bit(p, rq);
}
 }
@@ -2074,25 +2074,54 @@ void sched_exec(void)
 }
 
 /*
+ * This is a unique version of enqueue_task for the SMP case where a task
+ * has just been moved across runqueues. It uses the information from the
+ * old runqueue to help it make a decision much like recalc_task_prio. As
+ * the new runqueue is almost certainly at a different prio_level than the
+ * src_rq it is cheapest just to pick the next entitled slot.
+ */
+static inline void enqueue_pulled_task(struct rq *src_rq, struct rq *rq,
+  struct task_struct *p)
+{
+   int queue_prio;
+
+   p->array = rq->active;
+   if (!rt_task(p)) {
+   if (p->rotation == src_rq->prio_rotation) {
+   if (p->array == src_rq->expired) {
+   queue_expired(p, rq);
+   goto out_queue;
+   }
+   } else
+   task_new_array(p, rq);
+   }
+   queue_prio = next_entitled_slot(p, rq);
+   if (queue_prio >= MAX_PRIO) {
+   queue_expired(p, rq);
+   goto out_queue;
+   }
+   rq_quota(rq, queue_prio) += p->quota;
+   p->prio = queue_prio;
+out_queue:
+   p->normal_prio = p->prio;
+   p->rotation = rq->prio_rotation;
+   sched_info_queued(p);
+   set_dynamic_bit(p, rq);
+   list_add_tail(>run_list, p->array->queue + p->prio);
+}
+
+/*
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */
-static void pull_task(struct rq *src_rq, struct prio_array *src_array,
- struct task_struct *p, struct rq *this_rq,
- int this_cpu)
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+ struct rq *this_rq, int this_cpu)
 {
dequeue_task(p, src_rq);
dec_nr_running(p, src_rq);
set_task_cpu(p, this_cpu);
inc_nr_running(p, this_rq);
-
-   /*
-* If this task has already been running on src_rq this priority
-* cycl

Re: 2.6.21-rc4-mm1

2007-03-22 Thread Con Kolivas

On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
> Ok, I have yet a third x86_64 machine is is blowing up with the latest
> 2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
> 2.6.21-rc4-mm1+hotfixes-RSDL.  I have results on various hotfix levels
> so I have just fired off a set of tests across the affected machines on
> that latest hotfix stack plus the RSDL backout and the results should be
> in in the next hour or two.
>
> I think there is a strong correlation between RSDL and these hangs.  Any
> suggestions as to the next step.

Found a nasty in requeue_task
+   if (list_empty(old_array->queue + old_prio))
+   __clear_bit(old_prio, p->array->prio_bitmap);

see anything wrong there? I do :P

I'll queue that up with the other changes pending and hopefully that will fix 
your bug.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL v0.31

2007-03-22 Thread Con Kolivas

On Friday 23 March 2007 15:39, Mike Galbraith wrote:
> On Fri, 2007-03-23 at 09:50 +1100, Con Kolivas wrote:
> > Now to figure out some meaningful cheap way of improving this accounting.
>
> The accounting is easy iff tick resolution is good enough, the deadline
> mechanism is harder.  I did the "quota follows task" thing, but nothing
> good happens.  That just ensured that the deadline mechanism kicks in
> constantly because tick theft is a fact of tick-based life.  A
> reasonable fudge factor would help, but...
>
> I see problems wrt with trying to implement the deadline mechanism.
>
> As implemented, it can't identify who is doing the stealing (which
> happens constantly, even if userland if 100% hog) because of tick
> resolution accounting.  If you can't identify the culprit, you can't
> enforce the quota, and quotas which are not enforced are, strictly
> speaking, not quotas.  At tick time, you can only close the barn door
> after the cow has been stolen, and the thief can theoretically visit
> your barn an infinite number of times while you aren't watching the
> door.  ("don't blink" scenarios, and tick is backward-assward blink)
>
> You can count nanoseconds in schedule, and store the actual usage, but
> then you still have the problem of inaccuracies in sched_clock() from
> cross-cpu wakeup and migration.  Cross-cpu wakeups happen quite a lot.
> If sched_clock() _were_ absolutely accurate, you wouldn't need the
> runqueue deadline mechanism, because at slice tick time you can see
> everything you will ever see without moving enforcement directly into
> the most critical of paths.
>
> IMHO, unless it can be demonstrated that timeslice theft is a problem
> with a real-life scenario, you'd be better off dropping the queue
> ticking.  Time slices are a deadline mechanism, and in practice the god
> of randomness ensures that even fast movers do get caught often enough
> to make ticking tasks sufficient.
>
> (that was a very long-winded reply to one sentence because I spent a lot
> of time looking into this very subject and came to the conclusion that
> you can't get there from here.  fwiw, ymmv and all that of course;)
>
> > Thanks again!
>
> You're welcome.

The deadline mechanism is easy to hit and works. Try printk'ing it. There is 
some leeway to take tick accounting into the equation and I don't believe 
nanosecond resolution is required at all for this (how much leeway would you 
give then ;)). Eventually there is nothing to stop us using highres timers 
(blessed if they work as planned everywhere eventually) to do the events and 
do away with scheduler_tick entirely. For now ticks works fine; a reasonable 
estimate for smp migration will suffice (patch forthcoming).

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL 0.31 causes slowdown

2007-03-22 Thread Con Kolivas


On 23/03/07, Tim Chen <[EMAIL PROTECTED]> wrote:

Con,

I've tried running Volanomark and found a 80% regression
with RSDL 0.31 scheduler on 2.6.21-rc4 on a 2 socket Core 2 quad cpu
system (4 cpus per socket, 8 cpus for system).

The results are sensitive to rr_interval. Using Con's patch to increase
rr_interval to a large value of 100,
the regression reduced to 30% instead of 80%.

I ran Volanomark in loopback mode with 10 chatrooms
(20 clients per chatroom) configuration, with each client sending
out 1 messages.

http://www.volano.com/benchmarks.html

There are significant differences in the vmstat runqueue profile
between the 2.6.21-rc4 and the one with RSDL.

There are a lot less runnable jobs (see col 2) with RSDL 0.31  (rr_interval=15)
and higher idle time.


Thanks Tim.

Volanomark is a purely yield() semantic dependant workload (as
discussed many times previously). In the earlier form of RSDL I
softened the effect of sched_yield but other changes since then have
made that softness bordering on a noop. Obviously when sched_yield is
relied upon that will not be enough. Extending the rr interval simply
makes the yield slightly more effective and is not the proper
workaround. Since expiration of arrays is a regular frequent
occurrence in RSDL then changing yield semantics back to expiration
should cause a massive improvement in these values, without making the
yields as long as in mainline. It's impossible to know exactly what
the final result will be since java uses this timing sensitive yield
for locking but we can improve it drastically from this. I'll make a
patch soon to change yield again.

--
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL v0.31

2007-03-22 Thread Con Kolivas

Thanks for taking the time to actually look at the code. All audits are most 
welcome!.

On Thursday 22 March 2007 18:07, Mike Galbraith wrote:
> This is a rather long message, and isn't directed at anyone in
> particular, it's for others who may be digging into their own problems
> with RSDL, and for others (if any other than Con exist) who understand
> RSDL well enough to tell me if I'm missing something.  Anyone who's not
> interested in RSDL's gizzard hit 'D' now.
>
> On Wed, 2007-03-21 at 17:02 +0100, Peter Zijlstra wrote:
> > On Wed, 2007-03-21 at 15:57 +0100, Mike Galbraith wrote:
> > > 'f' is a progglet which sleeps a bit and burns a bit, duration
> > > depending on argument given. 'sh' is a shell 100% hog.  In this
> > > scenario, the argument was set such that 'f' used right at 50% cpu. 
> > > All are started at the same time, and I froze top when the first 'f'
> > > reached 1:00.
> >
> > May one enquire how much CPU the mythical 'f' uses when ran alone? Just
> > to get a gauge for the numbers?
>
> Actually, the numbers are an interesting curiosity point, but not as
> interesting as the fact that the deadline mechanism isn't kicking in.
>
> >From task_running_tick():
>
>   /*
>* Accounting is performed by both the task and the runqueue. This
>* allows frequently sleeping tasks to get their proper quota of
>* cpu as the runqueue will have their quota still available at
>* the appropriate priority level. It also means frequently waking
>* tasks that might miss the scheduler_tick() will get forced down
>* priority regardless.
>*/
>   if (!--p->time_slice)
>   task_expired_entitlement(rq, p);
>   /*
>* We only employ the deadline mechanism if we run over the quota.
>* It allows aliasing problems around the scheduler_tick to be
>* less harmful.
>*/
>   if (!rt_task(p) && --rq_quota(rq, rq->prio_level) < 0) {
>   if (unlikely(p->first_time_slice))
>   p->first_time_slice = 0;
>   rotate_runqueue_priority(rq);
>   set_tsk_need_resched(p);
>   }
>
> The reason for ticking both runqueue and task is that you can't sample a
> say 100KHz information stream at 1KHz and reproduce that information
> accurately.  IOW, task time slices "blur" at high switch frequency, you
> can't always hit tasks, so you hit what you _can_ hit every sample, the
> runqueue, to minimize the theoretical effects of time slice theft.
> (I've instrumented this before, and caught fast movers stealing 10s of
> milliseconds in extreme cases.)  Generally speaking, statistics even
> things out very much, the fast mover eventually gets hit, and pays a
> full tick for his sub-tick dip in the pool, so in practice it's not a
> great big hairy deal.
>
> If you can accept that tasks can and do dodge the tick, an imbalance
> between runqueue quota and task quota must occur.  It isn't happening
> here, and the reason appears to be bean counting error, tasks migrate
> but their quotas don't follow.  The first time a task is queued at any
> priority, quota is allocated, task goes to sleep, quota on departed
> runqueue stays behind, task awakens on a different runqueue, allocate
> more quota, repeat.  For migration, there's twist, if you pull an
> expired task, expired tasks don't have a quota yet, so they shouldn't
> screw up bean counting.

I had considered the quota not migrating to the new runqueue but basically it 
screws up the "set quota once and deadline only kicks in if absolutely 
necessary" policy. Migration means some extra quota is left behind on the 
runqueue it left from. It is never a huge extra quota and is reset on major 
rotation which occurs very frequently on rsdl. If I was to carry the quota 
over I would need to deduct p->time_slice from the source runqueue's quota, 
and add it to the target runqueue's quota. The problem there is that once the 
time_slice has been handed out to a task it is my position that I no longer 
trust the task to keep its accounting right and may well have exhausted all 
its quota from the source runqueue and is pulling quota away from tasks that 
haven't used theirs yet.

See below for more on updating prio rotation and adding quota to new runqueue.

>
> >From pull_task():
>
>   /*
>* If this task has already been running on src_rq this priority
>* cycle, make the new runqueue think it has been on its cycle
>*/
>   if (p->rotation == src_rq->prio_rotation)
>   p->rotation = this_rq->prio_rotation;
>
> The intent here is clearly that this task continue on the new cpu as if
> nothing has happened.  However, when the task was dequeued, p->array was
> left as it was, points to the last place it was queued.  Stale data.
>
> >From recalc_task_prio(), which is called by enqueue_task():
>
> static void recalc_task_prio(struct task_struct *p, struct rq *rq)
> {
>   struct prio_array *array =

Re: 2.6.21-rc4-mm1

2007-03-22 Thread Con Kolivas

On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
> Andy Whitcroft wrote:
> > Con Kolivas wrote:
> >> On Thursday 22 March 2007 20:48, Andy Whitcroft wrote:
> >>> Andy Whitcroft wrote:
> >>>> Andy Whitcroft wrote:
> >>>>> Andrew Morton wrote:
> >>>>>> Temporarily at
> >>>>>>
> >>>>>>   http://userweb.kernel.org/~akpm/2.6.21-rc4-mm1/
> >>>>>>
> >>>>>> Will appear later at
> >>>>>>
> >>>>>>
> >>>>>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21
> >>>>>>-rc 4/2.6.21-rc4-mm1/
> >>>>>
> >>>>> [All of the below is from the pre hot-fix runs.  The very few results
> >>>>> which are in for the hot-fix runs seem worse if anything.  :(  All
> >>>>> results should be out on TKO.]
> >>>>>
> >>>>>> - Restored the RSDL CPU scheduler (a new version thereof)
> >>>>>
> >>>>> Unsure if the above is the culprit but there seems to be a smattering
> >>>>> of BUG's in kernbench from the schedular on several systems, and
> >>>>> panics which do not fully dump out.
> >>>>>
> >>>>> elm3b239 is about 2/4 kernbench being the test in progress when we
> >>>>> blammo in both failed tests, elm3b234 doesn't boot at all.
> >>>>
> >>>> Well I have one result through for backing RSDL out on elm3b239 and
> >>>> that does indeed seem to give us a successful boot and test.  peterz
> >>>> has pointed me to an incremental patch from Con which I'll push
> >>>> through testing and see if that sorts it out.
> >>>
> >>> Ok, tested the patch below on top of 2.6.21-rc4-mm1 and this seems to
> >>> fix the problem:
> >>>
> >>> http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-mm1-rsdl-0.
> >>>32.p atch
> >>>
> >>> Hard to tell from that patch whether it will be fixed in the changes
> >>> already committed to the next -mm.
> >>>
> >>> Its possible that it may be fixed by the following patch:
> >>>
> >>> sched-rsdl-improvements.patch
> >>>
> >>> Which has the following slipped in at the end of the changelog:
> >>>
> >>> A tiny change checking for MAX_PRIO in normal_prio()
> >>> may prevent oopses on bootup on large SMP due to
> >>> forking off the idle task.
> >>>
> >>> Con, are all the changes in the 0.32 patch above with akpm?
> >>
> >> Yes he's queued everything in that patch you tested for the next -mm.
> >> Thanks very much for testing it.
> >
> > No worries.  I've just got through the results on the other machine in
> > the mix.  That machine seems to be fixed by backing out RSDL and not by
> > the fixup 0.32 patch ...
> >
> > This second machine seems to had hard very soon after user space starts
> > executing but without a panic.  I can't say that the symptoms are very
> > definitive, but I do have a good result from that machine without RSDL
> > and not with rsdl-0.32.
> >
> > The machine is a dual-core x86_64 machine: Dual Core AMD Opteron(tm)
> > Processor 275.
> >
> > I'll let you know if I find out anything else.  Shout if you want any
> > information or have anything you want poked or tested.
>
> Ok, I have yet a third x86_64 machine is is blowing up with the latest
> 2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
> 2.6.21-rc4-mm1+hotfixes-RSDL.  I have results on various hotfix levels
> so I have just fired off a set of tests across the affected machines on
> that latest hotfix stack plus the RSDL backout and the results should be
> in in the next hour or two.
>
> I think there is a strong correlation between RSDL and these hangs.  Any
> suggestions as to the next step.

If it's hitting the bug_on that I put in sched.c which you say it is then it 
is most certainly my fault. It implies a task has been queued without a 
corresponding bit being anywhere in the priority bitmaps. Somehow you only 
seem to be hitting it on big(ger) smp which is why I haven't seen it. It 
implies some complication occuring at sched or idle init/fork off these 
accounting not working. If I could reproduce it on qemu I'd step through the 
kernel init checking where each task is being queued and see if the bitmaps 
are being set. This is obviously time consuming and laborious so I don't 
expect you to do it. 

The next best thing is if you can send me the config of one of the machines 
that's oopsing I can try that on qemu but qemu is only good at debugging 
i386. If any of the machines that were oopsing were i386 that would be very 
helpful, otherwise x86_64 is the next best. Then I need to make a creative 
debugging patch for you to try which checks every queued/dequeued task and 
dumps all that information. I don't have that patch just yet so I need to 
find enough accumulated short stints at the pc to do that (still hurts a lot 
and worsens my condition).

Thanks!

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL v0.31

2007-03-22 Thread Con Kolivas

All code reviews are most welcome indeed!

On Thursday 22 March 2007 20:18, Ingo Molnar wrote:
> * Mike Galbraith <[EMAIL PROTECTED]> wrote:
> > Actually, the numbers are an interesting curiosity point, but not as
> > interesting as the fact that the deadline mechanism isn't kicking in.
>
> it's not just the scheduling accounting being off, RSDL also seems to be

I'll look at that when I have time.

> accessing stale data here:
> > >From pull_task():
> >
> > /*
> >  * If this task has already been running on src_rq this priority
> >  * cycle, make the new runqueue think it has been on its cycle
> >  */
> > if (p->rotation == src_rq->prio_rotation)
> > p->rotation = this_rq->prio_rotation;
> >
> > The intent here is clearly that this task continue on the new cpu as
> > if nothing has happened.  However, when the task was dequeued,
> > p->array was left as it was, points to the last place it was queued.
> > Stale data.

I don't think this is a problem because immediately after this in pull_task it 
calls enqueue_task() which always updates p->array in recalc_task_prio(). 
Every enqueue_task always calls recalc_task_prio on non-rt tasks so the array 
should always be set no matter where the entry point to scheduling is from 
unless I have a logic error in setting the p->array in recalc_task_prio() or 
there is another path to schedule() that I've not accounted for by making 
sure recalc_task_prio is done.

> it might point to a hot-unplugged CPU's runqueue as well. Which might
> work accidentally, but we want this fixed nevertheless.

The hot unplugged cpu's prio_rotation will be examined, and then it sets the 
prio_rotation from this runqueue's value. That shouldn't lead to any more 
problems than setting the timestamp based on the hot unplug cpus timestamp 
lower down also in pull_task()

p->timestamp = (p->timestamp - src_rq->most_recent_timestamp) +  
this_rq->most_recent_timestamp;

Thanks for looking!

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc4-mm1

2007-03-22 Thread Con Kolivas

On Thursday 22 March 2007 20:48, Andy Whitcroft wrote:
> Andy Whitcroft wrote:
> > Andy Whitcroft wrote:
> >> Andrew Morton wrote:
> >>> Temporarily at
> >>>
> >>>   http://userweb.kernel.org/~akpm/2.6.21-rc4-mm1/
> >>>
> >>> Will appear later at
> >>>
> >>>  
> >>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc
> >>>4/2.6.21-rc4-mm1/
> >>
> >> [All of the below is from the pre hot-fix runs.  The very few results
> >> which are in for the hot-fix runs seem worse if anything.  :(  All
> >> results should be out on TKO.]
> >>
> >>> - Restored the RSDL CPU scheduler (a new version thereof)
> >>
> >> Unsure if the above is the culprit but there seems to be a smattering of
> >> BUG's in kernbench from the schedular on several systems, and panics
> >> which do not fully dump out.
> >>
> >> elm3b239 is about 2/4 kernbench being the test in progress when we
> >> blammo in both failed tests, elm3b234 doesn't boot at all.
> >
> > Well I have one result through for backing RSDL out on elm3b239 and that
> > does indeed seem to give us a successful boot and test.  peterz has
> > pointed me to an incremental patch from Con which I'll push through
> > testing and see if that sorts it out.
>
> Ok, tested the patch below on top of 2.6.21-rc4-mm1 and this seems to
> fix the problem:
>
> http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-mm1-rsdl-0.32.p
>atch
>
> Hard to tell from that patch whether it will be fixed in the changes
> already committed to the next -mm.
>
> Its possible that it may be fixed by the following patch:
>
> sched-rsdl-improvements.patch
>
> Which has the following slipped in at the end of the changelog:
>
> A tiny change checking for MAX_PRIO in normal_prio()
> may prevent oopses on bootup on large SMP due to
> forking off the idle task.
>
> Con, are all the changes in the 0.32 patch above with akpm?

Yes he's queued everything in that patch you tested for the next -mm. Thanks 
very much for testing it.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc4-mm1

2007-03-22 Thread Con Kolivas

On Thursday 22 March 2007 20:48, Andy Whitcroft wrote:
 Andy Whitcroft wrote:
  Andy Whitcroft wrote:
  Andrew Morton wrote:
  Temporarily at
 
http://userweb.kernel.org/~akpm/2.6.21-rc4-mm1/
 
  Will appear later at
 
   
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc
 4/2.6.21-rc4-mm1/
 
  [All of the below is from the pre hot-fix runs.  The very few results
  which are in for the hot-fix runs seem worse if anything.  :(  All
  results should be out on TKO.]
 
  - Restored the RSDL CPU scheduler (a new version thereof)
 
  Unsure if the above is the culprit but there seems to be a smattering of
  BUG's in kernbench from the schedular on several systems, and panics
  which do not fully dump out.
 
  elm3b239 is about 2/4 kernbench being the test in progress when we
  blammo in both failed tests, elm3b234 doesn't boot at all.
 
  Well I have one result through for backing RSDL out on elm3b239 and that
  does indeed seem to give us a successful boot and test.  peterz has
  pointed me to an incremental patch from Con which I'll push through
  testing and see if that sorts it out.

 Ok, tested the patch below on top of 2.6.21-rc4-mm1 and this seems to
 fix the problem:

 http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-mm1-rsdl-0.32.p
atch

 Hard to tell from that patch whether it will be fixed in the changes
 already committed to the next -mm.

 Its possible that it may be fixed by the following patch:

 sched-rsdl-improvements.patch

 Which has the following slipped in at the end of the changelog:

 A tiny change checking for MAX_PRIO in normal_prio()
 may prevent oopses on bootup on large SMP due to
 forking off the idle task.

 Con, are all the changes in the 0.32 patch above with akpm?

Yes he's queued everything in that patch you tested for the next -mm. Thanks 
very much for testing it.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL v0.31

2007-03-22 Thread Con Kolivas

All code reviews are most welcome indeed!

On Thursday 22 March 2007 20:18, Ingo Molnar wrote:
 * Mike Galbraith [EMAIL PROTECTED] wrote:
  Actually, the numbers are an interesting curiosity point, but not as
  interesting as the fact that the deadline mechanism isn't kicking in.

 it's not just the scheduling accounting being off, RSDL also seems to be

I'll look at that when I have time.

 accessing stale data here:
  From pull_task():
 
  /*
   * If this task has already been running on src_rq this priority
   * cycle, make the new runqueue think it has been on its cycle
   */
  if (p-rotation == src_rq-prio_rotation)
  p-rotation = this_rq-prio_rotation;
 
  The intent here is clearly that this task continue on the new cpu as
  if nothing has happened.  However, when the task was dequeued,
  p-array was left as it was, points to the last place it was queued.
  Stale data.

I don't think this is a problem because immediately after this in pull_task it 
calls enqueue_task() which always updates p-array in recalc_task_prio(). 
Every enqueue_task always calls recalc_task_prio on non-rt tasks so the array 
should always be set no matter where the entry point to scheduling is from 
unless I have a logic error in setting the p-array in recalc_task_prio() or 
there is another path to schedule() that I've not accounted for by making 
sure recalc_task_prio is done.

 it might point to a hot-unplugged CPU's runqueue as well. Which might
 work accidentally, but we want this fixed nevertheless.

The hot unplugged cpu's prio_rotation will be examined, and then it sets the 
prio_rotation from this runqueue's value. That shouldn't lead to any more 
problems than setting the timestamp based on the hot unplug cpus timestamp 
lower down also in pull_task()

p-timestamp = (p-timestamp - src_rq-most_recent_timestamp) +  
this_rq-most_recent_timestamp;

Thanks for looking!

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc4-mm1

2007-03-22 Thread Con Kolivas

On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
 Andy Whitcroft wrote:
  Con Kolivas wrote:
  On Thursday 22 March 2007 20:48, Andy Whitcroft wrote:
  Andy Whitcroft wrote:
  Andy Whitcroft wrote:
  Andrew Morton wrote:
  Temporarily at
 
http://userweb.kernel.org/~akpm/2.6.21-rc4-mm1/
 
  Will appear later at
 
 
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21
 -rc 4/2.6.21-rc4-mm1/
 
  [All of the below is from the pre hot-fix runs.  The very few results
  which are in for the hot-fix runs seem worse if anything.  :(  All
  results should be out on TKO.]
 
  - Restored the RSDL CPU scheduler (a new version thereof)
 
  Unsure if the above is the culprit but there seems to be a smattering
  of BUG's in kernbench from the schedular on several systems, and
  panics which do not fully dump out.
 
  elm3b239 is about 2/4 kernbench being the test in progress when we
  blammo in both failed tests, elm3b234 doesn't boot at all.
 
  Well I have one result through for backing RSDL out on elm3b239 and
  that does indeed seem to give us a successful boot and test.  peterz
  has pointed me to an incremental patch from Con which I'll push
  through testing and see if that sorts it out.
 
  Ok, tested the patch below on top of 2.6.21-rc4-mm1 and this seems to
  fix the problem:
 
  http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc4-mm1-rsdl-0.
 32.p atch
 
  Hard to tell from that patch whether it will be fixed in the changes
  already committed to the next -mm.
 
  Its possible that it may be fixed by the following patch:
 
  sched-rsdl-improvements.patch
 
  Which has the following slipped in at the end of the changelog:
 
  A tiny change checking for MAX_PRIO in normal_prio()
  may prevent oopses on bootup on large SMP due to
  forking off the idle task.
 
  Con, are all the changes in the 0.32 patch above with akpm?
 
  Yes he's queued everything in that patch you tested for the next -mm.
  Thanks very much for testing it.
 
  No worries.  I've just got through the results on the other machine in
  the mix.  That machine seems to be fixed by backing out RSDL and not by
  the fixup 0.32 patch ...
 
  This second machine seems to had hard very soon after user space starts
  executing but without a panic.  I can't say that the symptoms are very
  definitive, but I do have a good result from that machine without RSDL
  and not with rsdl-0.32.
 
  The machine is a dual-core x86_64 machine: Dual Core AMD Opteron(tm)
  Processor 275.
 
  I'll let you know if I find out anything else.  Shout if you want any
  information or have anything you want poked or tested.

 Ok, I have yet a third x86_64 machine is is blowing up with the latest
 2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
 2.6.21-rc4-mm1+hotfixes-RSDL.  I have results on various hotfix levels
 so I have just fired off a set of tests across the affected machines on
 that latest hotfix stack plus the RSDL backout and the results should be
 in in the next hour or two.

 I think there is a strong correlation between RSDL and these hangs.  Any
 suggestions as to the next step.

If it's hitting the bug_on that I put in sched.c which you say it is then it 
is most certainly my fault. It implies a task has been queued without a 
corresponding bit being anywhere in the priority bitmaps. Somehow you only 
seem to be hitting it on big(ger) smp which is why I haven't seen it. It 
implies some complication occuring at sched or idle init/fork off these 
accounting not working. If I could reproduce it on qemu I'd step through the 
kernel init checking where each task is being queued and see if the bitmaps 
are being set. This is obviously time consuming and laborious so I don't 
expect you to do it. 

The next best thing is if you can send me the config of one of the machines 
that's oopsing I can try that on qemu but qemu is only good at debugging 
i386. If any of the machines that were oopsing were i386 that would be very 
helpful, otherwise x86_64 is the next best. Then I need to make a creative 
debugging patch for you to try which checks every queued/dequeued task and 
dumps all that information. I don't have that patch just yet so I need to 
find enough accumulated short stints at the pc to do that (still hurts a lot 
and worsens my condition).

Thanks!

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL v0.31

2007-03-22 Thread Con Kolivas

Thanks for taking the time to actually look at the code. All audits are most 
welcome!.

On Thursday 22 March 2007 18:07, Mike Galbraith wrote:
 This is a rather long message, and isn't directed at anyone in
 particular, it's for others who may be digging into their own problems
 with RSDL, and for others (if any other than Con exist) who understand
 RSDL well enough to tell me if I'm missing something.  Anyone who's not
 interested in RSDL's gizzard hit 'D' now.

 On Wed, 2007-03-21 at 17:02 +0100, Peter Zijlstra wrote:
  On Wed, 2007-03-21 at 15:57 +0100, Mike Galbraith wrote:
   'f' is a progglet which sleeps a bit and burns a bit, duration
   depending on argument given. 'sh' is a shell 100% hog.  In this
   scenario, the argument was set such that 'f' used right at 50% cpu. 
   All are started at the same time, and I froze top when the first 'f'
   reached 1:00.
 
  May one enquire how much CPU the mythical 'f' uses when ran alone? Just
  to get a gauge for the numbers?

 Actually, the numbers are an interesting curiosity point, but not as
 interesting as the fact that the deadline mechanism isn't kicking in.

 From task_running_tick():

   /*
* Accounting is performed by both the task and the runqueue. This
* allows frequently sleeping tasks to get their proper quota of
* cpu as the runqueue will have their quota still available at
* the appropriate priority level. It also means frequently waking
* tasks that might miss the scheduler_tick() will get forced down
* priority regardless.
*/
   if (!--p-time_slice)
   task_expired_entitlement(rq, p);
   /*
* We only employ the deadline mechanism if we run over the quota.
* It allows aliasing problems around the scheduler_tick to be
* less harmful.
*/
   if (!rt_task(p)  --rq_quota(rq, rq-prio_level)  0) {
   if (unlikely(p-first_time_slice))
   p-first_time_slice = 0;
   rotate_runqueue_priority(rq);
   set_tsk_need_resched(p);
   }

 The reason for ticking both runqueue and task is that you can't sample a
 say 100KHz information stream at 1KHz and reproduce that information
 accurately.  IOW, task time slices blur at high switch frequency, you
 can't always hit tasks, so you hit what you _can_ hit every sample, the
 runqueue, to minimize the theoretical effects of time slice theft.
 (I've instrumented this before, and caught fast movers stealing 10s of
 milliseconds in extreme cases.)  Generally speaking, statistics even
 things out very much, the fast mover eventually gets hit, and pays a
 full tick for his sub-tick dip in the pool, so in practice it's not a
 great big hairy deal.

 If you can accept that tasks can and do dodge the tick, an imbalance
 between runqueue quota and task quota must occur.  It isn't happening
 here, and the reason appears to be bean counting error, tasks migrate
 but their quotas don't follow.  The first time a task is queued at any
 priority, quota is allocated, task goes to sleep, quota on departed
 runqueue stays behind, task awakens on a different runqueue, allocate
 more quota, repeat.  For migration, there's twist, if you pull an
 expired task, expired tasks don't have a quota yet, so they shouldn't
 screw up bean counting.

I had considered the quota not migrating to the new runqueue but basically it 
screws up the set quota once and deadline only kicks in if absolutely 
necessary policy. Migration means some extra quota is left behind on the 
runqueue it left from. It is never a huge extra quota and is reset on major 
rotation which occurs very frequently on rsdl. If I was to carry the quota 
over I would need to deduct p-time_slice from the source runqueue's quota, 
and add it to the target runqueue's quota. The problem there is that once the 
time_slice has been handed out to a task it is my position that I no longer 
trust the task to keep its accounting right and may well have exhausted all 
its quota from the source runqueue and is pulling quota away from tasks that 
haven't used theirs yet.

See below for more on updating prio rotation and adding quota to new runqueue.


 From pull_task():

   /*
* If this task has already been running on src_rq this priority
* cycle, make the new runqueue think it has been on its cycle
*/
   if (p-rotation == src_rq-prio_rotation)
   p-rotation = this_rq-prio_rotation;

 The intent here is clearly that this task continue on the new cpu as if
 nothing has happened.  However, when the task was dequeued, p-array was
 left as it was, points to the last place it was queued.  Stale data.

 From recalc_task_prio(), which is called by enqueue_task():

 static void recalc_task_prio(struct task_struct *p, struct rq *rq)
 {
   struct prio_array *array = rq-active;
   int queue_prio, search_prio;

   if (p-rotation == rq-prio_rotation) {
   if

Re: RSDL 0.31 causes slowdown

2007-03-22 Thread Con Kolivas


On 23/03/07, Tim Chen [EMAIL PROTECTED] wrote:

Con,

I've tried running Volanomark and found a 80% regression
with RSDL 0.31 scheduler on 2.6.21-rc4 on a 2 socket Core 2 quad cpu
system (4 cpus per socket, 8 cpus for system).

The results are sensitive to rr_interval. Using Con's patch to increase
rr_interval to a large value of 100,
the regression reduced to 30% instead of 80%.

I ran Volanomark in loopback mode with 10 chatrooms
(20 clients per chatroom) configuration, with each client sending
out 1 messages.

http://www.volano.com/benchmarks.html

There are significant differences in the vmstat runqueue profile
between the 2.6.21-rc4 and the one with RSDL.

There are a lot less runnable jobs (see col 2) with RSDL 0.31  (rr_interval=15)
and higher idle time.


Thanks Tim.

Volanomark is a purely yield() semantic dependant workload (as
discussed many times previously). In the earlier form of RSDL I
softened the effect of sched_yield but other changes since then have
made that softness bordering on a noop. Obviously when sched_yield is
relied upon that will not be enough. Extending the rr interval simply
makes the yield slightly more effective and is not the proper
workaround. Since expiration of arrays is a regular frequent
occurrence in RSDL then changing yield semantics back to expiration
should cause a massive improvement in these values, without making the
yields as long as in mainline. It's impossible to know exactly what
the final result will be since java uses this timing sensitive yield
for locking but we can improve it drastically from this. I'll make a
patch soon to change yield again.

--
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL v0.31

2007-03-22 Thread Con Kolivas

On Friday 23 March 2007 15:39, Mike Galbraith wrote:
 On Fri, 2007-03-23 at 09:50 +1100, Con Kolivas wrote:
  Now to figure out some meaningful cheap way of improving this accounting.

 The accounting is easy iff tick resolution is good enough, the deadline
 mechanism is harder.  I did the quota follows task thing, but nothing
 good happens.  That just ensured that the deadline mechanism kicks in
 constantly because tick theft is a fact of tick-based life.  A
 reasonable fudge factor would help, but...

 I see problems wrt with trying to implement the deadline mechanism.

 As implemented, it can't identify who is doing the stealing (which
 happens constantly, even if userland if 100% hog) because of tick
 resolution accounting.  If you can't identify the culprit, you can't
 enforce the quota, and quotas which are not enforced are, strictly
 speaking, not quotas.  At tick time, you can only close the barn door
 after the cow has been stolen, and the thief can theoretically visit
 your barn an infinite number of times while you aren't watching the
 door.  (don't blink scenarios, and tick is backward-assward blink)

 You can count nanoseconds in schedule, and store the actual usage, but
 then you still have the problem of inaccuracies in sched_clock() from
 cross-cpu wakeup and migration.  Cross-cpu wakeups happen quite a lot.
 If sched_clock() _were_ absolutely accurate, you wouldn't need the
 runqueue deadline mechanism, because at slice tick time you can see
 everything you will ever see without moving enforcement directly into
 the most critical of paths.

 IMHO, unless it can be demonstrated that timeslice theft is a problem
 with a real-life scenario, you'd be better off dropping the queue
 ticking.  Time slices are a deadline mechanism, and in practice the god
 of randomness ensures that even fast movers do get caught often enough
 to make ticking tasks sufficient.

 (that was a very long-winded reply to one sentence because I spent a lot
 of time looking into this very subject and came to the conclusion that
 you can't get there from here.  fwiw, ymmv and all that of course;)

  Thanks again!

 You're welcome.

The deadline mechanism is easy to hit and works. Try printk'ing it. There is 
some leeway to take tick accounting into the equation and I don't believe 
nanosecond resolution is required at all for this (how much leeway would you 
give then ;)). Eventually there is nothing to stop us using highres timers 
(blessed if they work as planned everywhere eventually) to do the events and 
do away with scheduler_tick entirely. For now ticks works fine; a reasonable 
estimate for smp migration will suffice (patch forthcoming).

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc4-mm1

2007-03-22 Thread Con Kolivas

On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
 Ok, I have yet a third x86_64 machine is is blowing up with the latest
 2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
 2.6.21-rc4-mm1+hotfixes-RSDL.  I have results on various hotfix levels
 so I have just fired off a set of tests across the affected machines on
 that latest hotfix stack plus the RSDL backout and the results should be
 in in the next hour or two.

 I think there is a strong correlation between RSDL and these hangs.  Any
 suggestions as to the next step.

Found a nasty in requeue_task
+   if (list_empty(old_array-queue + old_prio))
+   __clear_bit(old_prio, p-array-prio_bitmap);

see anything wrong there? I do :P

I'll queue that up with the other changes pending and hopefully that will fix 
your bug.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sched: rsdl yet more fixes

2007-03-22 Thread Con Kolivas

This one should hopefully fix Andy's bug.

To be queued on top of what's already in -mm please. Will make v.33 with these
changes for other trees soon.

---
The wrong bit could be unset on requeue_task which could cause an oops.
Fix that.

sched_yield semantics became almost a noop so change back to expiring tasks
when yield is called.

recalc_task_prio() performed during pull_task() on SMP may not reliably
be doing the right thing to tasks queued on the new runqueue. Add a
special variant of enqueue_task that does its own local recalculation of
priority and quota.

rq-best_static_prio should not be set by realtime or SCHED_BATCH tasks.
Correct that, and microoptimise the code around setting best_static_prio.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 kernel/sched.c |  103 +++--
 1 file changed, 71 insertions(+), 32 deletions(-)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-23 11:28:25.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-23 17:28:19.0 +1100
@@ -714,17 +714,17 @@ static inline int entitled_slot(int stat
  */
 static inline int next_entitled_slot(struct task_struct *p, struct rq *rq)
 {
-   if (p-static_prio  rq-best_static_prio  p-policy != SCHED_BATCH)
-   return SCHED_PRIO(find_first_zero_bit(p-bitmap, PRIO_RANGE));
-   else {
-   DECLARE_BITMAP(tmp, PRIO_RANGE);
+   DECLARE_BITMAP(tmp, PRIO_RANGE);
+   int search_prio;
 
-   bitmap_or(tmp, p-bitmap,
- prio_matrix[USER_PRIO(p-static_prio)],
- PRIO_RANGE);
-   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
-   USER_PRIO(rq-prio_level)));
-   }
+   if (p-static_prio  rq-best_static_prio  p-policy != SCHED_BATCH)
+   search_prio = MAX_RT_PRIO;
+   else
+   search_prio = rq-prio_level;
+   bitmap_or(tmp, p-bitmap, prio_matrix[USER_PRIO(p-static_prio)],
+ PRIO_RANGE);
+   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
+   USER_PRIO(search_prio)));
 }
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
@@ -817,7 +817,7 @@ static void requeue_task(struct task_str
list_move_tail(p-run_list, p-array-queue + p-prio);
if (!rt_task(p)) {
if (list_empty(old_array-queue + old_prio))
-   __clear_bit(old_prio, p-array-prio_bitmap);
+   __clear_bit(old_prio, old_array-prio_bitmap);
set_dynamic_bit(p, rq);
}
 }
@@ -2074,25 +2074,54 @@ void sched_exec(void)
 }
 
 /*
+ * This is a unique version of enqueue_task for the SMP case where a task
+ * has just been moved across runqueues. It uses the information from the
+ * old runqueue to help it make a decision much like recalc_task_prio. As
+ * the new runqueue is almost certainly at a different prio_level than the
+ * src_rq it is cheapest just to pick the next entitled slot.
+ */
+static inline void enqueue_pulled_task(struct rq *src_rq, struct rq *rq,
+  struct task_struct *p)
+{
+   int queue_prio;
+
+   p-array = rq-active;
+   if (!rt_task(p)) {
+   if (p-rotation == src_rq-prio_rotation) {
+   if (p-array == src_rq-expired) {
+   queue_expired(p, rq);
+   goto out_queue;
+   }
+   } else
+   task_new_array(p, rq);
+   }
+   queue_prio = next_entitled_slot(p, rq);
+   if (queue_prio = MAX_PRIO) {
+   queue_expired(p, rq);
+   goto out_queue;
+   }
+   rq_quota(rq, queue_prio) += p-quota;
+   p-prio = queue_prio;
+out_queue:
+   p-normal_prio = p-prio;
+   p-rotation = rq-prio_rotation;
+   sched_info_queued(p);
+   set_dynamic_bit(p, rq);
+   list_add_tail(p-run_list, p-array-queue + p-prio);
+}
+
+/*
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */
-static void pull_task(struct rq *src_rq, struct prio_array *src_array,
- struct task_struct *p, struct rq *this_rq,
- int this_cpu)
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+ struct rq *this_rq, int this_cpu)
 {
dequeue_task(p, src_rq);
dec_nr_running(p, src_rq);
set_task_cpu(p, this_cpu);
inc_nr_running(p, this_rq);
-
-   /*
-* If this task has already been running on src_rq this priority
-* cycle, make the new runqueue think it has been on its cycle
-*/
-   if (p-rotation == src_rq-prio_rotation)
-   p-rotation = this_rq-prio_rotation;
-   enqueue_task(p

Re: [PATCH] sched: rsdl improvements

2007-03-21 Thread Con Kolivas

On Thursday 22 March 2007 10:36, Andrew Morton wrote:
> On Thu, 22 Mar 2007 04:29:44 +1100
>
> Con Kolivas <[EMAIL PROTECTED]> wrote:
> > Further improve the deterministic nature of the RSDL cpu scheduler and
> > make the rr_interval tunable.
>
> I might actually need to drop RSDL from next -mm, see if those sched oopses
> whcih several people have reported go away.

I did mention them in the changelog further down. While it may not be 
immediately apparent from the minimal emails I'm sending, I am trying hard to 
address every known regression in the time alloted. Without access to the 
hardware though I'm reliant on others testing it so I can't know for certain 
if I've fixed them.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sched: rsdl check for niced tasks lowering prio level

2007-03-21 Thread Con Kolivas

Here is the best fix for the bug pointed out. Thanks.

I'll try and find pc time to wrap these two patches together and make a v0.32
available.

---
Ensure niced tasks are not inappropriately limiting sleeping unniced tasks
by explicitly checking what the best static priority that has run this
major rotation was.

Reimplement SCHED_BATCH using this check.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 kernel/sched.c |   33 -
 1 file changed, 24 insertions(+), 9 deletions(-)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-22 12:44:05.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-22 12:58:26.0 +1100
@@ -201,8 +201,11 @@ struct rq {
struct prio_array *active, *expired, arrays[2];
unsigned long *dyn_bitmap, *exp_bitmap;
 
-   int prio_level;
-   /* The current dynamic priority level this runqueue is at */
+   int prio_level, best_static_prio;
+   /*
+* The current dynamic priority level this runqueue is at, and the
+* best static priority queued this major rotation.
+*/
 
unsigned long prio_rotation;
/* How many times we have rotated the priority queue */
@@ -704,16 +707,24 @@ static inline int entitled_slot(int stat
 
 /*
  * Find the first unused slot by this task that is also in its prio_matrix
- * level.
+ * level. Ensure that the prio_level is not unnecessarily low by checking
+ * that best_static_prio this major rotation was not a niced task.
+ * SCHED_BATCH tasks do not perform this check so they do not induce
+ * latencies in tasks of any nice level.
  */
 static inline int next_entitled_slot(struct task_struct *p, struct rq *rq)
 {
-   DECLARE_BITMAP(tmp, PRIO_RANGE);
+   if (p->static_prio < rq->best_static_prio && p->policy != SCHED_BATCH)
+   return SCHED_PRIO(find_first_zero_bit(p->bitmap, PRIO_RANGE));
+   else {
+   DECLARE_BITMAP(tmp, PRIO_RANGE);
 
-   bitmap_or(tmp, p->bitmap, prio_matrix[USER_PRIO(p->static_prio)],
- PRIO_RANGE);
-   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
-   USER_PRIO(rq->prio_level)));
+   bitmap_or(tmp, p->bitmap,
+ prio_matrix[USER_PRIO(p->static_prio)],
+ PRIO_RANGE);
+   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
+   USER_PRIO(rq->prio_level)));
+   }
 }
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
@@ -3315,6 +3326,7 @@ static inline void major_prio_rotation(s
rq->active = new_array;
rq->exp_bitmap = rq->expired->prio_bitmap;
rq->dyn_bitmap = rq->active->prio_bitmap;
+   rq->best_static_prio = MAX_PRIO;
rq->prio_rotation++;
 }
 
@@ -3640,10 +3652,12 @@ need_resched_nonpreemptible:
}
 switch_tasks:
if (next == rq->idle) {
+   rq->best_static_prio = MAX_PRIO;
rq->prio_level = MAX_RT_PRIO;
rq->prio_rotation++;
schedstat_inc(rq, sched_goidle);
-   }
+   } else if (next->static_prio < rq->best_static_prio)
+   rq->best_static_prio = next->static_prio;
prefetch(next);
prefetch_stack(next);
clear_tsk_need_resched(prev);
@@ -7093,6 +7107,7 @@ void __init sched_init(void)
lockdep_set_class(>lock, >rq_lock_key);
rq->nr_running = 0;
rq->prio_rotation = 0;
+   rq->best_static_prio = MAX_PRIO;
rq->prio_level = MAX_RT_PRIO;
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: rsdl improvements

2007-03-21 Thread Con Kolivas

On Thursday 22 March 2007 11:24, Con Kolivas wrote:
> On Thursday 22 March 2007 10:48, Jeffrey Hundstad wrote:
> > Artur Skawina wrote:
> > > Con Kolivas wrote:
> > >> Note no interactive boost idea here.
> > >>
> > >> Patch is for 2.6.21-rc4-mm1. I have not spent the time trying to bring
> > >> other bases in sync.
> > >
> > > I've tried RSDLv.31+this on 2.6.20.3 as i'm not tracking -mm.
> > >
> > >> Further improve the deterministic nature of the RSDL cpu scheduler and
> > >> make the rr_interval tunable.
> > >>
> > >> By only giving out priority slots to tasks at the current runqueue's
> > >> prio_level or below we can make the cpu allocation not altered by
> > >> accounting issues across major_rotation periods. This makes the cpu
> > >> allocation and latencies more deterministic, and decreases maximum
> > >> latencies substantially. This change removes the possibility that
> > >> tasks can get bursts of cpu activity which can favour towards
> > >> interactive tasks but also favour towards cpu bound tasks which happen
> > >> to wait on other activity (such as I/O) and is a net gain.
> > >
> > > I'm not sure this is going in the right direction... I'm writing
> > > this while compiling a kernel w/ "nice -20 make -j2" and X is almost
> >
> > Did you mean "nice -20"?  If so, that should have slowed X quite a bit.
> > Try "nice 19" instead.
> >
> > nice(1):
> >Run  COMMAND  with an adjusted niceness, which affects process
> > scheduling.  With no COMMAND, print the current  niceness.   Nicenesses
> > range from -20 (most favorable scheduling) to 19 (least favorable).
>
> No he's right. Something scrambled my brain and I've completely left out
> the part where I offer the old bursts as a tunable option as well, which
> unintentionally killed off SCHED_BATCH as an entity. I'll have to put that
> as an additional patch sorry as this by itself is not always a win. Hang in
> there.

Actually, reworking the priority matrix to always have a slot at position 1 
should fix this without needing a tunable. That is a better approach so I'll 
do that.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: rsdl improvements

2007-03-21 Thread Con Kolivas

On Thursday 22 March 2007 10:48, Jeffrey Hundstad wrote:
> Artur Skawina wrote:
> > Con Kolivas wrote:
> >> Note no interactive boost idea here.
> >>
> >> Patch is for 2.6.21-rc4-mm1. I have not spent the time trying to bring
> >> other bases in sync.
> >
> > I've tried RSDLv.31+this on 2.6.20.3 as i'm not tracking -mm.
> >
> >> Further improve the deterministic nature of the RSDL cpu scheduler and
> >> make the rr_interval tunable.
> >>
> >> By only giving out priority slots to tasks at the current runqueue's
> >> prio_level or below we can make the cpu allocation not altered by
> >> accounting issues across major_rotation periods. This makes the cpu
> >> allocation and latencies more deterministic, and decreases maximum
> >> latencies substantially. This change removes the possibility that tasks
> >> can get bursts of cpu activity which can favour towards interactive
> >> tasks but also favour towards cpu bound tasks which happen to wait on
> >> other activity (such as I/O) and is a net gain.
> >
> > I'm not sure this is going in the right direction... I'm writing
> > this while compiling a kernel w/ "nice -20 make -j2" and X is almost
>
> Did you mean "nice -20"?  If so, that should have slowed X quite a bit.
> Try "nice 19" instead.
>
> nice(1):
>Run  COMMAND  with an adjusted niceness, which affects process
> scheduling.  With no COMMAND, print the current  niceness.   Nicenesses
> range from -20 (most favorable scheduling) to 19 (least favorable).

No he's right. Something scrambled my brain and I've completely left out the 
part where I offer the old bursts as a tunable option as well, which 
unintentionally killed off SCHED_BATCH as an entity. I'll have to put that as 
an additional patch sorry as this by itself is not always a win. Hang in 
there.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sched: rsdl improvements

2007-03-21 Thread Con Kolivas

Hi all

As my time at the pc is limited I unfortunately cannot spend it responding to
the huge number of emails I got in response to RSDL. Instead, here's a patch.
I may be offline for extended periods at a time still so please others feel 
free to poke at the code and don't take it personally if I don't respond to
your emails. 

Note no interactive boost idea here.

Patch is for 2.6.21-rc4-mm1. I have not spent the time trying to bring other
bases in sync.

---
Further improve the deterministic nature of the RSDL cpu scheduler and make
the rr_interval tunable.

By only giving out priority slots to tasks at the current runqueue's
prio_level or below we can make the cpu allocation not altered by accounting
issues across major_rotation periods. This makes the cpu allocation and
latencies more deterministic, and decreases maximum latencies substantially.
This change removes the possibility that tasks can get bursts of cpu activity
which can favour towards interactive tasks but also favour towards cpu bound
tasks which happen to wait on other activity (such as I/O) and is a net
gain.

This change also makes negative nice values less harmful to latencies of more
niced tasks, and should lead to less preemption which might decrease the
context switch rate and subsequently improve throughput.

The rr_interval can be made a tunable such that if an environment exists that
is not as latency sensitive, it can be increased for maximum throughput.

A tiny change checking for MAX_PRIO in normal_prio() may prevent oopses on
bootup on large SMP due to forking off the idle task.

Other minor cleanups.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 Documentation/sysctl/kernel.txt |   12 +
 kernel/sched.c  |   94 ++--
 kernel/sysctl.c |   25 --
 3 files changed, 83 insertions(+), 48 deletions(-)

Index: linux-2.6.21-rc4-mm1/Documentation/sysctl/kernel.txt
===
--- linux-2.6.21-rc4-mm1.orig/Documentation/sysctl/kernel.txt   2007-03-21 
20:53:50.0 +1100
+++ linux-2.6.21-rc4-mm1/Documentation/sysctl/kernel.txt2007-03-21 
20:54:19.0 +1100
@@ -43,6 +43,7 @@ show up in /proc/sys/kernel:
 - printk
 - real-root-dev   ==> Documentation/initrd.txt
 - reboot-cmd  [ SPARC only ]
+- rr_interval
 - rtsig-max
 - rtsig-nr
 - sem
@@ -288,6 +289,17 @@ rebooting. ???
 
 ==
 
+rr_interval:
+
+This is the smallest duration that any cpu process scheduling unit
+will run for. Increasing this value can increase throughput of cpu
+bound tasks substantially but at the expense of increased latencies
+overall. This value is in _ticks_ and the default value chosen depends
+on the number of cpus available at scheduler initialisation. Valid
+values are from 1-100.
+
+==
+
 rtsig-max & rtsig-nr:
 
 The file rtsig-max can be used to tune the maximum number
Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-21 20:53:50.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-22 03:58:42.0 +1100
@@ -93,8 +93,10 @@ unsigned long long __attribute__((weak))
 /*
  * This is the time all tasks within the same priority round robin.
  * Set to a minimum of 8ms. Scales with number of cpus and rounds with HZ.
+ * Tunable via /proc interface.
  */
-static unsigned int rr_interval __read_mostly;
+int rr_interval __read_mostly;
+
 #define RR_INTERVAL8
 #define DEF_TIMESLICE  (rr_interval * 20)
 
@@ -686,19 +688,32 @@ static inline void task_new_array(struct
p->rotation = rq->prio_rotation;
 }
 
+/* Find the first slot from the relevant prio_matrix entry */
 static inline int first_prio_slot(struct task_struct *p)
 {
return SCHED_PRIO(find_first_zero_bit(
prio_matrix[USER_PRIO(p->static_prio)], PRIO_RANGE));
 }
 
-static inline int next_prio_slot(struct task_struct *p, int prio)
+/* Is a dynamic_prio part of the allocated slots for this static_prio */
+static inline int entitled_slot(int static_prio, int dynamic_prio)
+{
+   return !test_bit(USER_PRIO(dynamic_prio),
+   prio_matrix[USER_PRIO(static_prio)]);
+}
+
+/*
+ * Find the first unused slot by this task that is also in its prio_matrix
+ * level.
+ */
+static inline int next_entitled_slot(struct task_struct *p, struct rq *rq)
 {
DECLARE_BITMAP(tmp, PRIO_RANGE);
+
bitmap_or(tmp, p->bitmap, prio_matrix[USER_PRIO(p->static_prio)],
  PRIO_RANGE);
return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
-   USER_PRIO(prio)));
+   USER_PRIO(rq->prio_level)));
 }
 
 static void queue_expired(struct task_struct *p, st

[PATCH] sched: rsdl improvements

2007-03-21 Thread Con Kolivas

Hi all

As my time at the pc is limited I unfortunately cannot spend it responding to
the huge number of emails I got in response to RSDL. Instead, here's a patch.
I may be offline for extended periods at a time still so please others feel 
free to poke at the code and don't take it personally if I don't respond to
your emails. 

Note no interactive boost idea here.

Patch is for 2.6.21-rc4-mm1. I have not spent the time trying to bring other
bases in sync.

---
Further improve the deterministic nature of the RSDL cpu scheduler and make
the rr_interval tunable.

By only giving out priority slots to tasks at the current runqueue's
prio_level or below we can make the cpu allocation not altered by accounting
issues across major_rotation periods. This makes the cpu allocation and
latencies more deterministic, and decreases maximum latencies substantially.
This change removes the possibility that tasks can get bursts of cpu activity
which can favour towards interactive tasks but also favour towards cpu bound
tasks which happen to wait on other activity (such as I/O) and is a net
gain.

This change also makes negative nice values less harmful to latencies of more
niced tasks, and should lead to less preemption which might decrease the
context switch rate and subsequently improve throughput.

The rr_interval can be made a tunable such that if an environment exists that
is not as latency sensitive, it can be increased for maximum throughput.

A tiny change checking for MAX_PRIO in normal_prio() may prevent oopses on
bootup on large SMP due to forking off the idle task.

Other minor cleanups.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 Documentation/sysctl/kernel.txt |   12 +
 kernel/sched.c  |   94 ++--
 kernel/sysctl.c |   25 --
 3 files changed, 83 insertions(+), 48 deletions(-)

Index: linux-2.6.21-rc4-mm1/Documentation/sysctl/kernel.txt
===
--- linux-2.6.21-rc4-mm1.orig/Documentation/sysctl/kernel.txt   2007-03-21 
20:53:50.0 +1100
+++ linux-2.6.21-rc4-mm1/Documentation/sysctl/kernel.txt2007-03-21 
20:54:19.0 +1100
@@ -43,6 +43,7 @@ show up in /proc/sys/kernel:
 - printk
 - real-root-dev   == Documentation/initrd.txt
 - reboot-cmd  [ SPARC only ]
+- rr_interval
 - rtsig-max
 - rtsig-nr
 - sem
@@ -288,6 +289,17 @@ rebooting. ???
 
 ==
 
+rr_interval:
+
+This is the smallest duration that any cpu process scheduling unit
+will run for. Increasing this value can increase throughput of cpu
+bound tasks substantially but at the expense of increased latencies
+overall. This value is in _ticks_ and the default value chosen depends
+on the number of cpus available at scheduler initialisation. Valid
+values are from 1-100.
+
+==
+
 rtsig-max  rtsig-nr:
 
 The file rtsig-max can be used to tune the maximum number
Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-21 20:53:50.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-22 03:58:42.0 +1100
@@ -93,8 +93,10 @@ unsigned long long __attribute__((weak))
 /*
  * This is the time all tasks within the same priority round robin.
  * Set to a minimum of 8ms. Scales with number of cpus and rounds with HZ.
+ * Tunable via /proc interface.
  */
-static unsigned int rr_interval __read_mostly;
+int rr_interval __read_mostly;
+
 #define RR_INTERVAL8
 #define DEF_TIMESLICE  (rr_interval * 20)
 
@@ -686,19 +688,32 @@ static inline void task_new_array(struct
p-rotation = rq-prio_rotation;
 }
 
+/* Find the first slot from the relevant prio_matrix entry */
 static inline int first_prio_slot(struct task_struct *p)
 {
return SCHED_PRIO(find_first_zero_bit(
prio_matrix[USER_PRIO(p-static_prio)], PRIO_RANGE));
 }
 
-static inline int next_prio_slot(struct task_struct *p, int prio)
+/* Is a dynamic_prio part of the allocated slots for this static_prio */
+static inline int entitled_slot(int static_prio, int dynamic_prio)
+{
+   return !test_bit(USER_PRIO(dynamic_prio),
+   prio_matrix[USER_PRIO(static_prio)]);
+}
+
+/*
+ * Find the first unused slot by this task that is also in its prio_matrix
+ * level.
+ */
+static inline int next_entitled_slot(struct task_struct *p, struct rq *rq)
 {
DECLARE_BITMAP(tmp, PRIO_RANGE);
+
bitmap_or(tmp, p-bitmap, prio_matrix[USER_PRIO(p-static_prio)],
  PRIO_RANGE);
return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
-   USER_PRIO(prio)));
+   USER_PRIO(rq-prio_level)));
 }
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
@@ -725,23 +740,12 @@ static void

Re: [PATCH] sched: rsdl improvements

2007-03-21 Thread Con Kolivas

On Thursday 22 March 2007 10:48, Jeffrey Hundstad wrote:
 Artur Skawina wrote:
  Con Kolivas wrote:
  Note no interactive boost idea here.
 
  Patch is for 2.6.21-rc4-mm1. I have not spent the time trying to bring
  other bases in sync.
 
  I've tried RSDLv.31+this on 2.6.20.3 as i'm not tracking -mm.
 
  Further improve the deterministic nature of the RSDL cpu scheduler and
  make the rr_interval tunable.
 
  By only giving out priority slots to tasks at the current runqueue's
  prio_level or below we can make the cpu allocation not altered by
  accounting issues across major_rotation periods. This makes the cpu
  allocation and latencies more deterministic, and decreases maximum
  latencies substantially. This change removes the possibility that tasks
  can get bursts of cpu activity which can favour towards interactive
  tasks but also favour towards cpu bound tasks which happen to wait on
  other activity (such as I/O) and is a net gain.
 
  I'm not sure this is going in the right direction... I'm writing
  this while compiling a kernel w/ nice -20 make -j2 and X is almost

 Did you mean nice -20?  If so, that should have slowed X quite a bit.
 Try nice 19 instead.

 nice(1):
Run  COMMAND  with an adjusted niceness, which affects process
 scheduling.  With no COMMAND, print the current  niceness.   Nicenesses
 range from -20 (most favorable scheduling) to 19 (least favorable).

No he's right. Something scrambled my brain and I've completely left out the 
part where I offer the old bursts as a tunable option as well, which 
unintentionally killed off SCHED_BATCH as an entity. I'll have to put that as 
an additional patch sorry as this by itself is not always a win. Hang in 
there.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: rsdl improvements

2007-03-21 Thread Con Kolivas

On Thursday 22 March 2007 11:24, Con Kolivas wrote:
 On Thursday 22 March 2007 10:48, Jeffrey Hundstad wrote:
  Artur Skawina wrote:
   Con Kolivas wrote:
   Note no interactive boost idea here.
  
   Patch is for 2.6.21-rc4-mm1. I have not spent the time trying to bring
   other bases in sync.
  
   I've tried RSDLv.31+this on 2.6.20.3 as i'm not tracking -mm.
  
   Further improve the deterministic nature of the RSDL cpu scheduler and
   make the rr_interval tunable.
  
   By only giving out priority slots to tasks at the current runqueue's
   prio_level or below we can make the cpu allocation not altered by
   accounting issues across major_rotation periods. This makes the cpu
   allocation and latencies more deterministic, and decreases maximum
   latencies substantially. This change removes the possibility that
   tasks can get bursts of cpu activity which can favour towards
   interactive tasks but also favour towards cpu bound tasks which happen
   to wait on other activity (such as I/O) and is a net gain.
  
   I'm not sure this is going in the right direction... I'm writing
   this while compiling a kernel w/ nice -20 make -j2 and X is almost
 
  Did you mean nice -20?  If so, that should have slowed X quite a bit.
  Try nice 19 instead.
 
  nice(1):
 Run  COMMAND  with an adjusted niceness, which affects process
  scheduling.  With no COMMAND, print the current  niceness.   Nicenesses
  range from -20 (most favorable scheduling) to 19 (least favorable).

 No he's right. Something scrambled my brain and I've completely left out
 the part where I offer the old bursts as a tunable option as well, which
 unintentionally killed off SCHED_BATCH as an entity. I'll have to put that
 as an additional patch sorry as this by itself is not always a win. Hang in
 there.

Actually, reworking the priority matrix to always have a slot at position 1 
should fix this without needing a tunable. That is a better approach so I'll 
do that.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sched: rsdl check for niced tasks lowering prio level

2007-03-21 Thread Con Kolivas

Here is the best fix for the bug pointed out. Thanks.

I'll try and find pc time to wrap these two patches together and make a v0.32
available.

---
Ensure niced tasks are not inappropriately limiting sleeping unniced tasks
by explicitly checking what the best static priority that has run this
major rotation was.

Reimplement SCHED_BATCH using this check.

Signed-off-by: Con Kolivas [EMAIL PROTECTED]

---
 kernel/sched.c |   33 -
 1 file changed, 24 insertions(+), 9 deletions(-)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-22 12:44:05.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-22 12:58:26.0 +1100
@@ -201,8 +201,11 @@ struct rq {
struct prio_array *active, *expired, arrays[2];
unsigned long *dyn_bitmap, *exp_bitmap;
 
-   int prio_level;
-   /* The current dynamic priority level this runqueue is at */
+   int prio_level, best_static_prio;
+   /*
+* The current dynamic priority level this runqueue is at, and the
+* best static priority queued this major rotation.
+*/
 
unsigned long prio_rotation;
/* How many times we have rotated the priority queue */
@@ -704,16 +707,24 @@ static inline int entitled_slot(int stat
 
 /*
  * Find the first unused slot by this task that is also in its prio_matrix
- * level.
+ * level. Ensure that the prio_level is not unnecessarily low by checking
+ * that best_static_prio this major rotation was not a niced task.
+ * SCHED_BATCH tasks do not perform this check so they do not induce
+ * latencies in tasks of any nice level.
  */
 static inline int next_entitled_slot(struct task_struct *p, struct rq *rq)
 {
-   DECLARE_BITMAP(tmp, PRIO_RANGE);
+   if (p-static_prio  rq-best_static_prio  p-policy != SCHED_BATCH)
+   return SCHED_PRIO(find_first_zero_bit(p-bitmap, PRIO_RANGE));
+   else {
+   DECLARE_BITMAP(tmp, PRIO_RANGE);
 
-   bitmap_or(tmp, p-bitmap, prio_matrix[USER_PRIO(p-static_prio)],
- PRIO_RANGE);
-   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
-   USER_PRIO(rq-prio_level)));
+   bitmap_or(tmp, p-bitmap,
+ prio_matrix[USER_PRIO(p-static_prio)],
+ PRIO_RANGE);
+   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
+   USER_PRIO(rq-prio_level)));
+   }
 }
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
@@ -3315,6 +3326,7 @@ static inline void major_prio_rotation(s
rq-active = new_array;
rq-exp_bitmap = rq-expired-prio_bitmap;
rq-dyn_bitmap = rq-active-prio_bitmap;
+   rq-best_static_prio = MAX_PRIO;
rq-prio_rotation++;
 }
 
@@ -3640,10 +3652,12 @@ need_resched_nonpreemptible:
}
 switch_tasks:
if (next == rq-idle) {
+   rq-best_static_prio = MAX_PRIO;
rq-prio_level = MAX_RT_PRIO;
rq-prio_rotation++;
schedstat_inc(rq, sched_goidle);
-   }
+   } else if (next-static_prio  rq-best_static_prio)
+   rq-best_static_prio = next-static_prio;
prefetch(next);
prefetch_stack(next);
clear_tsk_need_resched(prev);
@@ -7093,6 +7107,7 @@ void __init sched_init(void)
lockdep_set_class(rq-lock, rq-rq_lock_key);
rq-nr_running = 0;
rq-prio_rotation = 0;
+   rq-best_static_prio = MAX_PRIO;
rq-prio_level = MAX_RT_PRIO;
rq-active = rq-arrays;
rq-expired = rq-arrays + 1;

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: rsdl improvements

2007-03-21 Thread Con Kolivas

On Thursday 22 March 2007 10:36, Andrew Morton wrote:
 On Thu, 22 Mar 2007 04:29:44 +1100

 Con Kolivas [EMAIL PROTECTED] wrote:
  Further improve the deterministic nature of the RSDL cpu scheduler and
  make the rr_interval tunable.

 I might actually need to drop RSDL from next -mm, see if those sched oopses
 whcih several people have reported go away.

I did mention them in the changelog further down. While it may not be 
immediately apparent from the minimal emails I'm sending, I am trying hard to 
address every known regression in the time alloted. Without access to the 
hardware though I'm reliant on others testing it so I can't know for certain 
if I've fixed them.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: is RSDL an "unfair" scheduler too?

2007-03-17 Thread Con Kolivas

On Saturday 17 March 2007 23:28, Ingo Molnar wrote:
> * Con Kolivas <[EMAIL PROTECTED]> wrote:
> > We're obviously disagreeing on what heuristics are [...]
>
> that could very well be so - it would be helpful if you could provide
> your own rough definition for the term, so that we can agree on how to
> call things?
>
> [ in any case, there's no rush here, please reply at your own pace, as
>   your condition allows. I wish you a speedy recovery! ]
>
> > You're simply cashing in on the deep pipes that do kernel work for
> > other tasks. You know very well that I dropped the TASK_NONINTERACTIVE
> > flag from rsdl which checks that tasks are waiting on pipes and you're
> > exploiting it.
>
> Con, i am not 'cashing in' on anything and i'm not 'exploiting'
> anything. The TASK_NONINTERACTIVE flag is totally irrelevant to my
> argument because i was not testing the vanilla scheduler, i was testing
> RSDL. I could have written this test using plain sockets, because i was
> testing RSDL's claim of not having heuristics, i was not testing the
> vanilla scheduler.
>
> I have simply replied to this claim of yours:
> > > Despite the claims to the contrary, RSDL does not have _less_
> > > heuristics, it does not have _any_. [...]
>
> and i showed you a workload under _RSDL_ that clearly shows that RSDL is
> an unfair scheduler too.
>
> my whole point was to counter the myth of 'RSDL has no heuristics'. Of
> course it has heuristics, which results in unfairness. (If it didnt have
> any heuristics that tilt the balance of scheduling towards sleep-intense
> tasks then a default Linux desktop would not be usable at all.)
>
> so the decision is _not_ a puristic "do we want to have heuristics or
> not", the question is a more practical "which heuristics are simpler,
> which heuristics are more flexible, which heuristics result in better
> behavior".
>
>   Ingo

Ok but please look at how it appears from my end (illness aside).

I spend 3 years just diddling with scheduler code trying my hardest to find a 
design that fixes a whole swag of problems we still have, and a swag of 
problems we might get with other fixes.

You initially said you were pleased with this design.

..lots of code, testing, bugfixes and good feedback.

Then Mike has one testcase that most other users disagree is worthy of being 
considered a regresssion. You latched onto that and basically called it a 
showstopper in spite of who knows how many other positive things.

Then you quickly produce a counter patch designed to kill off RSDL with a 
config option for mainline.

Then you boldly announce on LKML "is RSDL an "unfair" scheduler too?" with 
some test case you whipped up to try and find fault with the design.

What am I supposed to think? Considering just how many problems I have 
addressed and tried to correct with RSDL succesfully I'm surprised that 
despite your enthusiasm for it initially you have spent the rest of the time 
trying to block it.

Please, either help me (and I'm in no shape to code at the moment despite what 
I have done so far), or say you have no intention of including it. I'm 
risking paralysis just by sitting at the computer right now so I'm dropping 
the code as is at the moment and will leave it up to your better judgement as 
to what to do with it.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: is RSDL an "unfair" scheduler too?

2007-03-17 Thread Con Kolivas

On Saturday 17 March 2007 22:49, Ingo Molnar wrote:
> * Con Kolivas <[EMAIL PROTECTED]> wrote:
> > Despite the claims to the contrary, RSDL does not have _less_
> > heuristics, it does not have _any_. It's purely entitlement based.
>
> RSDL still has heuristics very much, but this time it's hardcoded into
> the design! Let me demonstrate this via a simple experiment.
>
> in the vanilla scheduler, the heuristics are ontop of a fairly basic
> (and fast) scheduler, they are plain visible and thus 'optional'. In
> RSDL, the heuristics are still present but more hidden and more
> engrained into the design.
>
> But it's easy to demonstrate this under RSDL: consider the following two
> scenarios, which implement precisely the same fundamental computing
> workload (everything running on the same, default nice 0 level):
>
> 1) a single task runs almost all the time and sleeps about 1 msec every
>100 msecs.
>
>[ run "while N=1; do N=1; done &" under bash to create such a
>  workload. ]
>
> 2) tasks are in a 'ring' where each runs for 100 msec, sleeps for 1
>msec and passes the 'token' around to the next task in the ring. (in
>essence every task will sleep 9900 msecs before getting another run)
>
>[ run http://redhat.com/~mingo/scheduler-patches/ring-test.c to
>  create this workload. If the 100 tasks default is too much for you
>  then you can run "./ring-test 10" - that will show similar effects.
>]
>
> Workload #1 uses 100% of CPU time. Workload #2 uses 99% of CPU time.
> They both do in essence the same thing.
>
> if RSDL had no heuristics at all then if i mixed #1 with #2, both
> workloads would get roughly 50%/50% of the CPU, right? (as happens if i
> mix #1 with #1 - both CPU-intense workloads get half of the CPU)
>
> in reality, in the 'ring workload' case, RSDL will only give about _5%_
> of CPU time to the #1 CPU-intense task, and will give 95% of CPU time to
> the #2 'ring' of tasks. So the distribution of timeslices is
> significantly unfair!
>
> Why? Because RSDL still has heuristics, just elsewhere and more hidden:
> in the "straightforward CPU intense task" case RSDL will 'penalize' the
> task by depleting its quota for running nearly all the time, in the
> "ring of tasks" case the 100 tasks will each run near their priority
> maximum, fed by 'major epoch' events of RSDL, thus they get 'rewarded'
> for seemingly sleeping alot and spreading things out. So RSDL has
> fundamental unfairness built in as well - it's just different from the
> vanilla scheduler.

We're obviously disagreeing on what heuristics are so call it what you like.

You're simply cashing in on the deep pipes that do kernel work for other 
tasks. You know very well that I dropped the TASK_NONINTERACTIVE flag from 
rsdl which checks that tasks are waiting on pipes and you're exploiting it. 
That's not the RSDL heuristics at work at all, but you're trying to make it 
look like it is the intrinsic RSDL system at work. Putting that flag back in 
is simple enough when I'm not drugged. You could have simply pointed that out 
instead of trying to make my code look responsible. 

For the moment I'll assume you're not simply trying to make my code look bad 
and that you thought there really was an intrinsic design problem, otherwise 
I'd really be unhappy with what was happening to me.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 1175 matches

Mail list logo