this is what you need

2018-02-13 Thread Peter Williams

Hi,

I wanted to check in with you, did you receive my email from last week?

I want to share a proven system with you.

This system allows you to try the whole thing for f.r_ee for 30 days.

You can finally change your future without giving up any sensitive
information in advance.

I s-ig-ned up myself just a while ago and I'm already making more than in
my regular nine to five job that I plan on quitting any day now.

Despite of this, this is probably the best thing that ever happened to you
if you take action now.

Please reply if interested.

Thanks,
Peter



this is what you need

2018-02-13 Thread Peter Williams

Hi,

I wanted to check in with you, did you receive my email from last week?

I want to share a proven system with you.

This system allows you to try the whole thing for f.r_ee for 30 days.

You can finally change your future without giving up any sensitive
information in advance.

I s-ig-ned up myself just a while ago and I'm already making more than in
my regular nine to five job that I plan on quitting any day now.

Despite of this, this is probably the best thing that ever happened to you
if you take action now.

Please reply if interested.

Thanks,
Peter



Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-16 Thread Peter Williams

Jarek Poplawski wrote:

On 16-10-2007 03:16, Peter Williams wrote:
...
I'd suggest that we modify sched_rr_get_interval() to return -EINVAL 
(with *interval set to zero) if the target task is not SCHED_RR.  That 
way we can save a lot of unnecessary code.  I'll work on a patch.

...

I like this idea! But, since this a system call maybe at least
something like RFC would be nicer...


We would be just modifying the code to meet that specification so a 
patch would be OK.  Anyone who wants to comment will do so anyway :-).





Sorry for too harsh words.

I didn't consider them harsh.


So, I can't be mistaken for a rapper yet? I'll work on it...

Cheers,
Jarek P.

PS: Peter, for some unknown reason I don't receive your messages.
If you get back some errors from my side I'd be interested to see
it (alternative: jarkao2 at gmail.com).


I haven't seen any bounce notifications.  I've added the qmail address 
as a CC.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-16 Thread Peter Williams

Jarek Poplawski wrote:

On 16-10-2007 03:16, Peter Williams wrote:
...
I'd suggest that we modify sched_rr_get_interval() to return -EINVAL 
(with *interval set to zero) if the target task is not SCHED_RR.  That 
way we can save a lot of unnecessary code.  I'll work on a patch.

...

I like this idea! But, since this a system call maybe at least
something like RFC would be nicer...


We would be just modifying the code to meet that specification so a 
patch would be OK.  Anyone who wants to comment will do so anyway :-).





Sorry for too harsh words.

I didn't consider them harsh.


So, I can't be mistaken for a rapper yet? I'll work on it...

Cheers,
Jarek P.

PS: Peter, for some unknown reason I don't receive your messages.
If you get back some errors from my side I'd be interested to see
it (alternative: jarkao2 at gmail.com).


I haven't seen any bounce notifications.  I've added the qmail address 
as a CC.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-15 Thread Peter Williams

Jarek Poplawski wrote:

On 13-10-2007 03:29, Peter Williams wrote:

Jarek Poplawski wrote:

On 12-10-2007 00:23, Peter Williams wrote:
...
The reason I was going that route was for modularity (which helps 
when adding plugsched patches).  I'll submit a revised patch for 
consideration.

...

IMHO, it looks like modularity could suck here:


+static unsigned int default_timeslice_fair(struct task_struct *p)
+{
+return NS_TO_JIFFIES(sysctl_sched_min_granularity);
+}

If it's needed for outside and sched_fair will use something else
(to avoid double conversion) this could be misleading. Shouldn't
this be kind of private and return something usable for the class
mainly?
This is supplying data for a system call not something for internal use 
by the class.  As far as the sched_fair class is concerned this is just 
a (necessary - because it's need by a system call) diversion.


So, now all is clear: this is the misleading case!


Why anything else than sched_fair should care about this?
sched_fair doesn't care so if nothing else does why do we even have 
sys_sched_rr_get_interval()?  Is this whole function an anachronism that 
can be expunged?  I'm assuming that the reason it exists is that there 
are user space programs that use this system call.  Am I correct in this 
assumption?  Personally, I can't think of anything it would be useful 
for other than satisfying curiosity.


Since this is for some special aim (not default for most classes, at
least not for sched_fair) I'd suggest to change names:
default_timeslice_fair() and .default_timeslice to something like eg.:
rr_timeslice_fair() and .rr_timeslice or rr_interval_fair() and
.rr_interval (maybe with this "default" before_"rr_" if necessary).

On the other hand man (2) sched_rr_get_interval mentions that:
"The identified process should be running under the SCHED_RR
scheduling policy".

Also this place seems to say about something simpler:
http://www.gnu.org/software/libc/manual/html_node/Basic-Scheduling-Functions.html

So, I still doubt sched_fair's "notion" of timeslices should be
necessary here.


As do I.  Even more so now that you've shown me the man page for 
sched_rr_get_interval().


I'd suggest that we modify sched_rr_get_interval() to return -EINVAL 
(with *interval set to zero) if the target task is not SCHED_RR.  That 
way we can save a lot of unnecessary code.  I'll work on a patch. 
Unless you want to do it?




Sorry for too harsh words.


I didn't consider them harsh.

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-15 Thread Peter Williams

Jarek Poplawski wrote:

On 13-10-2007 03:29, Peter Williams wrote:

Jarek Poplawski wrote:

On 12-10-2007 00:23, Peter Williams wrote:
...
The reason I was going that route was for modularity (which helps 
when adding plugsched patches).  I'll submit a revised patch for 
consideration.

...

IMHO, it looks like modularity could suck here:


+static unsigned int default_timeslice_fair(struct task_struct *p)
+{
+return NS_TO_JIFFIES(sysctl_sched_min_granularity);
+}

If it's needed for outside and sched_fair will use something else
(to avoid double conversion) this could be misleading. Shouldn't
this be kind of private and return something usable for the class
mainly?
This is supplying data for a system call not something for internal use 
by the class.  As far as the sched_fair class is concerned this is just 
a (necessary - because it's need by a system call) diversion.


So, now all is clear: this is the misleading case!


Why anything else than sched_fair should care about this?
sched_fair doesn't care so if nothing else does why do we even have 
sys_sched_rr_get_interval()?  Is this whole function an anachronism that 
can be expunged?  I'm assuming that the reason it exists is that there 
are user space programs that use this system call.  Am I correct in this 
assumption?  Personally, I can't think of anything it would be useful 
for other than satisfying curiosity.


Since this is for some special aim (not default for most classes, at
least not for sched_fair) I'd suggest to change names:
default_timeslice_fair() and .default_timeslice to something like eg.:
rr_timeslice_fair() and .rr_timeslice or rr_interval_fair() and
.rr_interval (maybe with this default before_rr_ if necessary).

On the other hand man (2) sched_rr_get_interval mentions that:
The identified process should be running under the SCHED_RR
scheduling policy.

Also this place seems to say about something simpler:
http://www.gnu.org/software/libc/manual/html_node/Basic-Scheduling-Functions.html

So, I still doubt sched_fair's notion of timeslices should be
necessary here.


As do I.  Even more so now that you've shown me the man page for 
sched_rr_get_interval().


I'd suggest that we modify sched_rr_get_interval() to return -EINVAL 
(with *interval set to zero) if the target task is not SCHED_RR.  That 
way we can save a lot of unnecessary code.  I'll work on a patch. 
Unless you want to do it?




Sorry for too harsh words.


I didn't consider them harsh.

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-12 Thread Peter Williams

Jarek Poplawski wrote:

On 12-10-2007 00:23, Peter Williams wrote:
...
The reason I was going that route was for modularity (which helps when 
adding plugsched patches).  I'll submit a revised patch for consideration.

...

IMHO, it looks like modularity could suck here:


+static unsigned int default_timeslice_fair(struct task_struct *p)
+{
+   return NS_TO_JIFFIES(sysctl_sched_min_granularity);
+}


If it's needed for outside and sched_fair will use something else
(to avoid double conversion) this could be misleading. Shouldn't
this be kind of private and return something usable for the class
mainly?


This is supplying data for a system call not something for internal use 
by the class.  As far as the sched_fair class is concerned this is just 
a (necessary - because it's need by a system call) diversion.



Why anything else than sched_fair should care about this?


sched_fair doesn't care so if nothing else does why do we even have 
sys_sched_rr_get_interval()?  Is this whole function an anachronism that 
can be expunged?  I'm assuming that the reason it exists is that there 
are user space programs that use this system call.  Am I correct in this 
assumption?  Personally, I can't think of anything it would be useful 
for other than satisfying curiosity.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-12 Thread Peter Williams

Jarek Poplawski wrote:

On 12-10-2007 00:23, Peter Williams wrote:
...
The reason I was going that route was for modularity (which helps when 
adding plugsched patches).  I'll submit a revised patch for consideration.

...

IMHO, it looks like modularity could suck here:


+static unsigned int default_timeslice_fair(struct task_struct *p)
+{
+   return NS_TO_JIFFIES(sysctl_sched_min_granularity);
+}


If it's needed for outside and sched_fair will use something else
(to avoid double conversion) this could be misleading. Shouldn't
this be kind of private and return something usable for the class
mainly?


This is supplying data for a system call not something for internal use 
by the class.  As far as the sched_fair class is concerned this is just 
a (necessary - because it's need by a system call) diversion.



Why anything else than sched_fair should care about this?


sched_fair doesn't care so if nothing else does why do we even have 
sys_sched_rr_get_interval()?  Is this whole function an anachronism that 
can be expunged?  I'm assuming that the reason it exists is that there 
are user space programs that use this system call.  Am I correct in this 
assumption?  Personally, I can't think of anything it would be useful 
for other than satisfying curiosity.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-11 Thread Peter Williams

Dmitry Adamushko wrote:

On 11/10/2007, Ingo Molnar <[EMAIL PROTECTED]> wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:


-#define MIN_TIMESLICEmax(5 * HZ / 1000, 1)
-#define DEF_TIMESLICE(100 * HZ / 1000)

hm, this got removed by Dmitry quite some time ago. Could you please do
this patch against the sched-devel git tree:


here is the commit:
http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a=commit;h=dd3fec36addd1bf76b05225b7e483378b80c3f9e

I had also considered introducing smth like
sched_class::task_timeslice() but decided it was not worth it.


The reason I was going that route was for modularity (which helps when 
adding plugsched patches).  I'll submit a revised patch for consideration.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-11 Thread Peter Williams
At the moment, static_prio_timeslice() is only used in 
sys_sched_rr_get_interval() and only gives the correct result for 
SCHED_FIFO and SCHED_RR tasks as the time slice for normal tasks is 
unrelated to the values returned by static_prio_timeslice().


This patch addresses this problem and in the process moves all the code 
associated with static_prio_timeslice() to sched_rt.c which is the only 
place where it now has relevance.


Signed-off-by: Peter Williams <[EMAIL PROTECTED]>

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce




diff -r 3df82b0661ca include/linux/sched.h
--- a/include/linux/sched.h	Mon Sep 03 12:06:59 2007 +1000
+++ b/include/linux/sched.h	Mon Sep 03 12:06:59 2007 +1000
@@ -878,6 +878,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
 	void (*task_new) (struct rq *rq, struct task_struct *p);
+	unsigned int (*default_timeslice) (struct task_struct *p);
 };
 
 struct load_weight {
diff -r 3df82b0661ca kernel/sched.c
--- a/kernel/sched.c	Mon Sep 03 12:06:59 2007 +1000
+++ b/kernel/sched.c	Mon Sep 03 12:06:59 2007 +1000
@@ -101,16 +101,6 @@ unsigned long long __attribute__((weak))
 #define NICE_0_LOAD		SCHED_LOAD_SCALE
 #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
 
-/*
- * These are the 'tuning knobs' of the scheduler:
- *
- * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger),
- * default timeslice is 100 msecs, maximum timeslice is 800 msecs.
- * Timeslices get refilled after they expire.
- */
-#define MIN_TIMESLICE		max(5 * HZ / 1000, 1)
-#define DEF_TIMESLICE		(100 * HZ / 1000)
-
 #ifdef CONFIG_SMP
 /*
  * Divide a load by a sched group cpu_power : (load / sg->__cpu_power)
@@ -131,24 +121,6 @@ static inline void sg_inc_cpu_power(stru
 	sg->reciprocal_cpu_power = reciprocal_value(sg->__cpu_power);
 }
 #endif
-
-#define SCALE_PRIO(x, prio) \
-	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
-
-/*
- * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
- * to time slice values: [800ms ... 100ms ... 5ms]
- */
-static unsigned int static_prio_timeslice(int static_prio)
-{
-	if (static_prio == NICE_TO_PRIO(19))
-		return 1;
-
-	if (static_prio < NICE_TO_PRIO(0))
-		return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio);
-	else
-		return SCALE_PRIO(DEF_TIMESLICE, static_prio);
-}
 
 static inline int rt_policy(int policy)
 {
@@ -4784,8 +4756,7 @@ long sys_sched_rr_get_interval(pid_t pid
 	if (retval)
 		goto out_unlock;
 
-	jiffies_to_timespec(p->policy == SCHED_FIFO ?
-0 : static_prio_timeslice(p->static_prio), );
+	jiffies_to_timespec(p->sched_class->default_timeslice(p), );
 	read_unlock(_lock);
 	retval = copy_to_user(interval, , sizeof(t)) ? -EFAULT : 0;
 out_nounlock:
diff -r 3df82b0661ca kernel/sched_fair.c
--- a/kernel/sched_fair.c	Mon Sep 03 12:06:59 2007 +1000
+++ b/kernel/sched_fair.c	Mon Sep 03 12:06:59 2007 +1000
@@ -1159,6 +1159,11 @@ static void set_curr_task_fair(struct rq
 }
 #endif
 
+static unsigned int default_timeslice_fair(struct task_struct *p)
+{
+	return NS_TO_JIFFIES(sysctl_sched_min_granularity);
+}
+
 /*
  * All the scheduling class methods:
  */
@@ -1180,6 +1185,7 @@ struct sched_class fair_sched_class __re
 	.set_curr_task  = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
 	.task_new		= task_new_fair,
+	.default_timeslice	= default_timeslice_fair,
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff -r 3df82b0661ca kernel/sched_idletask.c
--- a/kernel/sched_idletask.c	Mon Sep 03 12:06:59 2007 +1000
+++ b/kernel/sched_idletask.c	Mon Sep 03 12:06:59 2007 +1000
@@ -59,6 +59,11 @@ static void task_tick_idle(struct rq *rq
 {
 }
 
+static unsigned int default_timeslice_idle(struct task_struct *p)
+{
+	return 0;
+}
+
 /*
  * Simple, special scheduling class for the per-CPU idle tasks:
  */
@@ -80,4 +85,5 @@ static struct sched_class idle_sched_cla
 
 	.task_tick		= task_tick_idle,
 	/* no .task_new for idle tasks */
+	.default_timeslice	= default_timeslice_idle,
 };
diff -r 3df82b0661ca kernel/sched_rt.c
--- a/kernel/sched_rt.c	Mon Sep 03 12:06:59 2007 +1000
+++ b/kernel/sched_rt.c	Mon Sep 03 12:06:59 2007 +1000
@@ -205,6 +205,34 @@ move_one_task_rt(struct rq *this_rq, int
 }
 #endif
 
+/*
+ * These are the 'tuning knobs' of the scheduler:
+ *
+ * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger),
+ * default timeslice is 100 msecs, maximum timeslice is 800 msecs.
+ * Timeslices get refilled after they expire.
+ */
+#define MIN_TIMESLICE		max(5 * HZ / 1000, 1)
+#define DEF_TIMESLICE		(100 * HZ / 1000)
+
+#define SCALE_PRIO(x, prio) \
+	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
+
+/*
+ * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
+ * to time slice values: [800ms ... 100ms ... 5ms]
+ */
+static unsigned int static_prio_timeslice(int static_pr

[PATCH] sched: Exclude SMP code from non SMP builds

2007-10-11 Thread Peter Williams
At the moment, a lot of load balancing code that is irrelevant to non 
SMP systems gets included during non SMP builds.


This patch addresses this issue and should reduce the binary size on non 
SMP systems.


This patch assumes that the "sched: Reduce overhead in balance_tasks()" 
(non urgent) patch that I sent on the 15th of August has been applied.


Signed-off-by: Peter Williams <[EMAIL PROTECTED]>

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce



diff -r df69cb019596 include/linux/sched.h
--- a/include/linux/sched.h	Thu Aug 16 12:12:18 2007 +1000
+++ b/include/linux/sched.h	Fri Aug 17 13:54:28 2007 +1000
@@ -864,6 +864,7 @@ struct sched_class {
 	struct task_struct * (*pick_next_task) (struct rq *rq);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
+#ifdef CONFIG_SMP
 	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
 			struct rq *busiest, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
@@ -872,6 +873,7 @@ struct sched_class {
 	int (*move_one_task) (struct rq *this_rq, int this_cpu,
 			  struct rq *busiest, struct sched_domain *sd,
 			  enum cpu_idle_type idle);
+#endif
 
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
diff -r df69cb019596 kernel/sched.c
--- a/kernel/sched.c	Thu Aug 16 12:12:18 2007 +1000
+++ b/kernel/sched.c	Fri Aug 17 16:03:11 2007 +1000
@@ -764,23 +764,6 @@ iter_move_one_task(struct rq *this_rq, i
 iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		   struct sched_domain *sd, enum cpu_idle_type idle,
 		   struct rq_iterator *iterator);
-#else
-static inline unsigned long
-balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-	  unsigned long max_load_move, struct sched_domain *sd,
-	  enum cpu_idle_type idle, int *all_pinned,
-	  int *this_best_prio, struct rq_iterator *iterator)
-{
-	return 0;
-}
-
-static inline int
-iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		   struct sched_domain *sd, enum cpu_idle_type idle,
-		   struct rq_iterator *iterator)
-{
-	return 0;
-}
 #endif
 
 #include "sched_stats.h"
diff -r df69cb019596 kernel/sched_fair.c
--- a/kernel/sched_fair.c	Thu Aug 16 12:12:18 2007 +1000
+++ b/kernel/sched_fair.c	Fri Aug 17 16:00:21 2007 +1000
@@ -887,6 +887,7 @@ static void put_prev_task_fair(struct rq
 	}
 }
 
+#ifdef CONFIG_SMP
 /**
  * Fair scheduling class load-balancing methods:
  */
@@ -1004,6 +1005,7 @@ move_one_task_fair(struct rq *this_rq, i
 
 	return 0;
 }
+#endif
 
 /*
  * scheduler tick hitting a task of our scheduling class:
@@ -1090,8 +1092,10 @@ struct sched_class fair_sched_class __re
 	.pick_next_task		= pick_next_task_fair,
 	.put_prev_task		= put_prev_task_fair,
 
+#ifdef CONFIG_SMP
 	.load_balance		= load_balance_fair,
 	.move_one_task		= move_one_task_fair,
+#endif
 
 	.set_curr_task  = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
diff -r df69cb019596 kernel/sched_idletask.c
--- a/kernel/sched_idletask.c	Thu Aug 16 12:12:18 2007 +1000
+++ b/kernel/sched_idletask.c	Fri Aug 17 15:58:59 2007 +1000
@@ -37,6 +37,7 @@ static void put_prev_task_idle(struct rq
 {
 }
 
+#ifdef CONFIG_SMP
 static unsigned long
 load_balance_idle(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
@@ -52,6 +53,7 @@ move_one_task_idle(struct rq *this_rq, i
 {
 	return 0;
 }
+#endif
 
 static void task_tick_idle(struct rq *rq, struct task_struct *curr)
 {
@@ -71,8 +73,10 @@ static struct sched_class idle_sched_cla
 	.pick_next_task		= pick_next_task_idle,
 	.put_prev_task		= put_prev_task_idle,
 
+#ifdef CONFIG_SMP
 	.load_balance		= load_balance_idle,
 	.move_one_task		= move_one_task_idle,
+#endif
 
 	.task_tick		= task_tick_idle,
 	/* no .task_new for idle tasks */
diff -r df69cb019596 kernel/sched_rt.c
--- a/kernel/sched_rt.c	Thu Aug 16 12:12:18 2007 +1000
+++ b/kernel/sched_rt.c	Fri Aug 17 15:53:20 2007 +1000
@@ -98,6 +98,7 @@ static void put_prev_task_rt(struct rq *
 	p->se.exec_start = 0;
 }
 
+#ifdef CONFIG_SMP
 /*
  * Load-balancing iterator. Note: while the runqueue stays locked
  * during the whole iteration, the current task might be
@@ -202,6 +203,7 @@ move_one_task_rt(struct rq *this_rq, int
 	return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle,
   _rq_iterator);
 }
+#endif
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)
 {
@@ -232,8 +234,10 @@ static struct sched_class rt_sched_class
 	.pick_next_task		= pick_next_task_rt,
 	.put_prev_task		= put_prev_task_rt,
 
+#ifdef CONFIG_SMP
 	.load_balance		= load_balance_rt,
 	.move_one_task		= move_one_task_rt,
+#endif
 
 	.task_tick		= task_tick_rt,
 };


[PATCH] sched: Exclude SMP code from non SMP builds

2007-10-11 Thread Peter Williams
At the moment, a lot of load balancing code that is irrelevant to non 
SMP systems gets included during non SMP builds.


This patch addresses this issue and should reduce the binary size on non 
SMP systems.


This patch assumes that the sched: Reduce overhead in balance_tasks() 
(non urgent) patch that I sent on the 15th of August has been applied.


Signed-off-by: Peter Williams [EMAIL PROTECTED]

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
-- Ambrose Bierce



diff -r df69cb019596 include/linux/sched.h
--- a/include/linux/sched.h	Thu Aug 16 12:12:18 2007 +1000
+++ b/include/linux/sched.h	Fri Aug 17 13:54:28 2007 +1000
@@ -864,6 +864,7 @@ struct sched_class {
 	struct task_struct * (*pick_next_task) (struct rq *rq);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
+#ifdef CONFIG_SMP
 	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
 			struct rq *busiest, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
@@ -872,6 +873,7 @@ struct sched_class {
 	int (*move_one_task) (struct rq *this_rq, int this_cpu,
 			  struct rq *busiest, struct sched_domain *sd,
 			  enum cpu_idle_type idle);
+#endif
 
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
diff -r df69cb019596 kernel/sched.c
--- a/kernel/sched.c	Thu Aug 16 12:12:18 2007 +1000
+++ b/kernel/sched.c	Fri Aug 17 16:03:11 2007 +1000
@@ -764,23 +764,6 @@ iter_move_one_task(struct rq *this_rq, i
 iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		   struct sched_domain *sd, enum cpu_idle_type idle,
 		   struct rq_iterator *iterator);
-#else
-static inline unsigned long
-balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-	  unsigned long max_load_move, struct sched_domain *sd,
-	  enum cpu_idle_type idle, int *all_pinned,
-	  int *this_best_prio, struct rq_iterator *iterator)
-{
-	return 0;
-}
-
-static inline int
-iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		   struct sched_domain *sd, enum cpu_idle_type idle,
-		   struct rq_iterator *iterator)
-{
-	return 0;
-}
 #endif
 
 #include sched_stats.h
diff -r df69cb019596 kernel/sched_fair.c
--- a/kernel/sched_fair.c	Thu Aug 16 12:12:18 2007 +1000
+++ b/kernel/sched_fair.c	Fri Aug 17 16:00:21 2007 +1000
@@ -887,6 +887,7 @@ static void put_prev_task_fair(struct rq
 	}
 }
 
+#ifdef CONFIG_SMP
 /**
  * Fair scheduling class load-balancing methods:
  */
@@ -1004,6 +1005,7 @@ move_one_task_fair(struct rq *this_rq, i
 
 	return 0;
 }
+#endif
 
 /*
  * scheduler tick hitting a task of our scheduling class:
@@ -1090,8 +1092,10 @@ struct sched_class fair_sched_class __re
 	.pick_next_task		= pick_next_task_fair,
 	.put_prev_task		= put_prev_task_fair,
 
+#ifdef CONFIG_SMP
 	.load_balance		= load_balance_fair,
 	.move_one_task		= move_one_task_fair,
+#endif
 
 	.set_curr_task  = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
diff -r df69cb019596 kernel/sched_idletask.c
--- a/kernel/sched_idletask.c	Thu Aug 16 12:12:18 2007 +1000
+++ b/kernel/sched_idletask.c	Fri Aug 17 15:58:59 2007 +1000
@@ -37,6 +37,7 @@ static void put_prev_task_idle(struct rq
 {
 }
 
+#ifdef CONFIG_SMP
 static unsigned long
 load_balance_idle(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
@@ -52,6 +53,7 @@ move_one_task_idle(struct rq *this_rq, i
 {
 	return 0;
 }
+#endif
 
 static void task_tick_idle(struct rq *rq, struct task_struct *curr)
 {
@@ -71,8 +73,10 @@ static struct sched_class idle_sched_cla
 	.pick_next_task		= pick_next_task_idle,
 	.put_prev_task		= put_prev_task_idle,
 
+#ifdef CONFIG_SMP
 	.load_balance		= load_balance_idle,
 	.move_one_task		= move_one_task_idle,
+#endif
 
 	.task_tick		= task_tick_idle,
 	/* no .task_new for idle tasks */
diff -r df69cb019596 kernel/sched_rt.c
--- a/kernel/sched_rt.c	Thu Aug 16 12:12:18 2007 +1000
+++ b/kernel/sched_rt.c	Fri Aug 17 15:53:20 2007 +1000
@@ -98,6 +98,7 @@ static void put_prev_task_rt(struct rq *
 	p-se.exec_start = 0;
 }
 
+#ifdef CONFIG_SMP
 /*
  * Load-balancing iterator. Note: while the runqueue stays locked
  * during the whole iteration, the current task might be
@@ -202,6 +203,7 @@ move_one_task_rt(struct rq *this_rq, int
 	return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle,
   rt_rq_iterator);
 }
+#endif
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)
 {
@@ -232,8 +234,10 @@ static struct sched_class rt_sched_class
 	.pick_next_task		= pick_next_task_rt,
 	.put_prev_task		= put_prev_task_rt,
 
+#ifdef CONFIG_SMP
 	.load_balance		= load_balance_rt,
 	.move_one_task		= move_one_task_rt,
+#endif
 
 	.task_tick		= task_tick_rt,
 };


[PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-11 Thread Peter Williams
At the moment, static_prio_timeslice() is only used in 
sys_sched_rr_get_interval() and only gives the correct result for 
SCHED_FIFO and SCHED_RR tasks as the time slice for normal tasks is 
unrelated to the values returned by static_prio_timeslice().


This patch addresses this problem and in the process moves all the code 
associated with static_prio_timeslice() to sched_rt.c which is the only 
place where it now has relevance.


Signed-off-by: Peter Williams [EMAIL PROTECTED]

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce




diff -r 3df82b0661ca include/linux/sched.h
--- a/include/linux/sched.h	Mon Sep 03 12:06:59 2007 +1000
+++ b/include/linux/sched.h	Mon Sep 03 12:06:59 2007 +1000
@@ -878,6 +878,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
 	void (*task_new) (struct rq *rq, struct task_struct *p);
+	unsigned int (*default_timeslice) (struct task_struct *p);
 };
 
 struct load_weight {
diff -r 3df82b0661ca kernel/sched.c
--- a/kernel/sched.c	Mon Sep 03 12:06:59 2007 +1000
+++ b/kernel/sched.c	Mon Sep 03 12:06:59 2007 +1000
@@ -101,16 +101,6 @@ unsigned long long __attribute__((weak))
 #define NICE_0_LOAD		SCHED_LOAD_SCALE
 #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
 
-/*
- * These are the 'tuning knobs' of the scheduler:
- *
- * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger),
- * default timeslice is 100 msecs, maximum timeslice is 800 msecs.
- * Timeslices get refilled after they expire.
- */
-#define MIN_TIMESLICE		max(5 * HZ / 1000, 1)
-#define DEF_TIMESLICE		(100 * HZ / 1000)
-
 #ifdef CONFIG_SMP
 /*
  * Divide a load by a sched group cpu_power : (load / sg-__cpu_power)
@@ -131,24 +121,6 @@ static inline void sg_inc_cpu_power(stru
 	sg-reciprocal_cpu_power = reciprocal_value(sg-__cpu_power);
 }
 #endif
-
-#define SCALE_PRIO(x, prio) \
-	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
-
-/*
- * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
- * to time slice values: [800ms ... 100ms ... 5ms]
- */
-static unsigned int static_prio_timeslice(int static_prio)
-{
-	if (static_prio == NICE_TO_PRIO(19))
-		return 1;
-
-	if (static_prio  NICE_TO_PRIO(0))
-		return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio);
-	else
-		return SCALE_PRIO(DEF_TIMESLICE, static_prio);
-}
 
 static inline int rt_policy(int policy)
 {
@@ -4784,8 +4756,7 @@ long sys_sched_rr_get_interval(pid_t pid
 	if (retval)
 		goto out_unlock;
 
-	jiffies_to_timespec(p-policy == SCHED_FIFO ?
-0 : static_prio_timeslice(p-static_prio), t);
+	jiffies_to_timespec(p-sched_class-default_timeslice(p), t);
 	read_unlock(tasklist_lock);
 	retval = copy_to_user(interval, t, sizeof(t)) ? -EFAULT : 0;
 out_nounlock:
diff -r 3df82b0661ca kernel/sched_fair.c
--- a/kernel/sched_fair.c	Mon Sep 03 12:06:59 2007 +1000
+++ b/kernel/sched_fair.c	Mon Sep 03 12:06:59 2007 +1000
@@ -1159,6 +1159,11 @@ static void set_curr_task_fair(struct rq
 }
 #endif
 
+static unsigned int default_timeslice_fair(struct task_struct *p)
+{
+	return NS_TO_JIFFIES(sysctl_sched_min_granularity);
+}
+
 /*
  * All the scheduling class methods:
  */
@@ -1180,6 +1185,7 @@ struct sched_class fair_sched_class __re
 	.set_curr_task  = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
 	.task_new		= task_new_fair,
+	.default_timeslice	= default_timeslice_fair,
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff -r 3df82b0661ca kernel/sched_idletask.c
--- a/kernel/sched_idletask.c	Mon Sep 03 12:06:59 2007 +1000
+++ b/kernel/sched_idletask.c	Mon Sep 03 12:06:59 2007 +1000
@@ -59,6 +59,11 @@ static void task_tick_idle(struct rq *rq
 {
 }
 
+static unsigned int default_timeslice_idle(struct task_struct *p)
+{
+	return 0;
+}
+
 /*
  * Simple, special scheduling class for the per-CPU idle tasks:
  */
@@ -80,4 +85,5 @@ static struct sched_class idle_sched_cla
 
 	.task_tick		= task_tick_idle,
 	/* no .task_new for idle tasks */
+	.default_timeslice	= default_timeslice_idle,
 };
diff -r 3df82b0661ca kernel/sched_rt.c
--- a/kernel/sched_rt.c	Mon Sep 03 12:06:59 2007 +1000
+++ b/kernel/sched_rt.c	Mon Sep 03 12:06:59 2007 +1000
@@ -205,6 +205,34 @@ move_one_task_rt(struct rq *this_rq, int
 }
 #endif
 
+/*
+ * These are the 'tuning knobs' of the scheduler:
+ *
+ * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger),
+ * default timeslice is 100 msecs, maximum timeslice is 800 msecs.
+ * Timeslices get refilled after they expire.
+ */
+#define MIN_TIMESLICE		max(5 * HZ / 1000, 1)
+#define DEF_TIMESLICE		(100 * HZ / 1000)
+
+#define SCALE_PRIO(x, prio) \
+	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
+
+/*
+ * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
+ * to time slice values: [800ms ... 100ms ... 5ms]
+ */
+static unsigned int static_prio_timeslice(int static_prio)
+{
+	if (static_prio == NICE_TO_PRIO

Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-11 Thread Peter Williams

Dmitry Adamushko wrote:

On 11/10/2007, Ingo Molnar [EMAIL PROTECTED] wrote:

* Peter Williams [EMAIL PROTECTED] wrote:


-#define MIN_TIMESLICEmax(5 * HZ / 1000, 1)
-#define DEF_TIMESLICE(100 * HZ / 1000)

hm, this got removed by Dmitry quite some time ago. Could you please do
this patch against the sched-devel git tree:


here is the commit:
http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a=commit;h=dd3fec36addd1bf76b05225b7e483378b80c3f9e

I had also considered introducing smth like
sched_class::task_timeslice() but decided it was not worth it.


The reason I was going that route was for modularity (which helps when 
adding plugsched patches).  I'll submit a revised patch for consideration.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Reduce overhead in balance_tasks()

2007-08-24 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:

At the moment, balance_tasks() provides low level functionality for 
both
 move_tasks() and move_one_task() (indirectly) via the load_balance() 
function (in the sched_class interface) which also provides dual 
functionality.  This dual functionality complicates the interfaces and 
internal mechanisms and makes the run time overhead of operations that 
are called with two run queue locks held.


This patch addresses this issue and reduces the overhead of these 
operations.


hm, i like it, and added it to my queue (probably .24 material though), 
but note that it increases .text and .data overhead:


   textdata bss dec hex filename
  41028   377942168   80990   13c5e sched.o.before
  41349   378262168   81343   13dbf sched.o.after

is that expected?


Yes, sort off.  It's a trade off of space for time and I expected an 
increase (although I didn't think that it would be quite that much). 
But it's still less than 1% and since the time saved is time when two 
run queue locks are held I figure that it's a trade worth making.  Also 
this separation lays the ground for a clean up of the active load 
balancing code which should gain some space including making it possible 
to exclude active load balancing on systems that don't need it (i.e. 
those that don't have multiple multi core/hyperthreading packages).


I've got a patch available that reduces the .text and .data for non SMP 
systems by excluding the load balancing stuff (that has crept into those 
systems) so that should help on embedded systems where memory is tight.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Reduce overhead in balance_tasks()

2007-08-24 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams [EMAIL PROTECTED] wrote:

At the moment, balance_tasks() provides low level functionality for 
both
 move_tasks() and move_one_task() (indirectly) via the load_balance() 
function (in the sched_class interface) which also provides dual 
functionality.  This dual functionality complicates the interfaces and 
internal mechanisms and makes the run time overhead of operations that 
are called with two run queue locks held.


This patch addresses this issue and reduces the overhead of these 
operations.


hm, i like it, and added it to my queue (probably .24 material though), 
but note that it increases .text and .data overhead:


   textdata bss dec hex filename
  41028   377942168   80990   13c5e sched.o.before
  41349   378262168   81343   13dbf sched.o.after

is that expected?


Yes, sort off.  It's a trade off of space for time and I expected an 
increase (although I didn't think that it would be quite that much). 
But it's still less than 1% and since the time saved is time when two 
run queue locks are held I figure that it's a trade worth making.  Also 
this separation lays the ground for a clean up of the active load 
balancing code which should gain some space including making it possible 
to exclude active load balancing on systems that don't need it (i.e. 
those that don't have multiple multi core/hyperthreading packages).


I've got a patch available that reduces the .text and .data for non SMP 
systems by excluding the load balancing stuff (that has crept into those 
systems) so that should help on embedded systems where memory is tight.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: Reduce overhead in balance_tasks()

2007-08-15 Thread Peter Williams
At the moment, balance_tasks() provides low level functionality for both 
 move_tasks() and move_one_task() (indirectly) via the load_balance() 
function (in the sched_class interface) which also provides dual 
functionality.  This dual functionality complicates the interfaces and 
internal mechanisms and makes the run time overhead of operations that 
are called with two run queue locks held.


This patch addresses this issue and reduces the overhead of these 
operations.


This patch is not urgent and can be held back until the next merge 
window without compromising the safety of the kernel.


Signed-off-by: Peter Williams <[EMAIL PROTECTED]>

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
   -- Ambrose Bierce


diff -r 90691a597f06 include/linux/sched.h
--- a/include/linux/sched.h	Mon Aug 13 15:06:35 2007 +
+++ b/include/linux/sched.h	Tue Aug 14 11:11:47 2007 +1000
@@ -865,10 +865,13 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
-			struct rq *busiest,
-			unsigned long max_nr_move, unsigned long max_load_move,
+			struct rq *busiest, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *all_pinned, int *this_best_prio);
+
+	int (*move_one_task) (struct rq *this_rq, int this_cpu,
+			  struct rq *busiest, struct sched_domain *sd,
+			  enum cpu_idle_type idle);
 
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
diff -r 90691a597f06 kernel/sched.c
--- a/kernel/sched.c	Mon Aug 13 15:06:35 2007 +
+++ b/kernel/sched.c	Tue Aug 14 16:26:24 2007 +1000
@@ -753,11 +753,35 @@ struct rq_iterator {
 	struct task_struct *(*next)(void *);
 };
 
-static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		  unsigned long max_nr_move, unsigned long max_load_move,
-		  struct sched_domain *sd, enum cpu_idle_type idle,
-		  int *all_pinned, unsigned long *load_moved,
-		  int *this_best_prio, struct rq_iterator *iterator);
+#ifdef CONFIG_SMP
+static unsigned long
+balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
+	  unsigned long max_load_move, struct sched_domain *sd,
+	  enum cpu_idle_type idle, int *all_pinned,
+	  int *this_best_prio, struct rq_iterator *iterator);
+
+static int
+iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		   struct sched_domain *sd, enum cpu_idle_type idle,
+		   struct rq_iterator *iterator);
+#else
+static inline unsigned long
+balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
+	  unsigned long max_load_move, struct sched_domain *sd,
+	  enum cpu_idle_type idle, int *all_pinned,
+	  int *this_best_prio, struct rq_iterator *iterator)
+{
+	return 0;
+}
+
+static inline int
+iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		   struct sched_domain *sd, enum cpu_idle_type idle,
+		   struct rq_iterator *iterator)
+{
+	return 0;
+}
+#endif
 
 #include "sched_stats.h"
 #include "sched_rt.c"
@@ -2166,17 +2190,17 @@ int can_migrate_task(struct task_struct 
 	return 1;
 }
 
-static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		  unsigned long max_nr_move, unsigned long max_load_move,
-		  struct sched_domain *sd, enum cpu_idle_type idle,
-		  int *all_pinned, unsigned long *load_moved,
-		  int *this_best_prio, struct rq_iterator *iterator)
+static unsigned long
+balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
+	  unsigned long max_load_move, struct sched_domain *sd,
+	  enum cpu_idle_type idle, int *all_pinned,
+	  int *this_best_prio, struct rq_iterator *iterator)
 {
 	int pulled = 0, pinned = 0, skip_for_load;
 	struct task_struct *p;
 	long rem_load_move = max_load_move;
 
-	if (max_nr_move == 0 || max_load_move == 0)
+	if (max_load_move == 0)
 		goto out;
 
 	pinned = 1;
@@ -2209,7 +2233,7 @@ next:
 	 * We only want to steal up to the prescribed number of tasks
 	 * and the prescribed amount of weighted load.
 	 */
-	if (pulled < max_nr_move && rem_load_move > 0) {
+	if (rem_load_move > 0) {
 		if (p->prio < *this_best_prio)
 			*this_best_prio = p->prio;
 		p = iterator->next(iterator->arg);
@@ -2217,7 +2241,7 @@ next:
 	}
 out:
 	/*
-	 * Right now, this is the only place pull_task() is called,
+	 * Right now, this is one of only two places pull_task() is called,
 	 * so we can safely collect pull_task() stats here rather than
 	 * inside pull_task().
 	 */
@@ -2225,8 +2249,8 @@ out:
 
 	if (all_pinned)
 		*all_pinned = pinned;
-	*load_moved = max_load_move - rem_load_move;
-	return pulled;
+
+	return max_load_move - rem_load_move;
 }
 
 /*
@@ -2248,7 +2272,7 @@ static int move_tasks(struct rq *t

[PATCH] sched: Reduce overhead in balance_tasks()

2007-08-15 Thread Peter Williams
At the moment, balance_tasks() provides low level functionality for both 
 move_tasks() and move_one_task() (indirectly) via the load_balance() 
function (in the sched_class interface) which also provides dual 
functionality.  This dual functionality complicates the interfaces and 
internal mechanisms and makes the run time overhead of operations that 
are called with two run queue locks held.


This patch addresses this issue and reduces the overhead of these 
operations.


This patch is not urgent and can be held back until the next merge 
window without compromising the safety of the kernel.


Signed-off-by: Peter Williams [EMAIL PROTECTED]

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
   -- Ambrose Bierce


diff -r 90691a597f06 include/linux/sched.h
--- a/include/linux/sched.h	Mon Aug 13 15:06:35 2007 +
+++ b/include/linux/sched.h	Tue Aug 14 11:11:47 2007 +1000
@@ -865,10 +865,13 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
-			struct rq *busiest,
-			unsigned long max_nr_move, unsigned long max_load_move,
+			struct rq *busiest, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *all_pinned, int *this_best_prio);
+
+	int (*move_one_task) (struct rq *this_rq, int this_cpu,
+			  struct rq *busiest, struct sched_domain *sd,
+			  enum cpu_idle_type idle);
 
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
diff -r 90691a597f06 kernel/sched.c
--- a/kernel/sched.c	Mon Aug 13 15:06:35 2007 +
+++ b/kernel/sched.c	Tue Aug 14 16:26:24 2007 +1000
@@ -753,11 +753,35 @@ struct rq_iterator {
 	struct task_struct *(*next)(void *);
 };
 
-static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		  unsigned long max_nr_move, unsigned long max_load_move,
-		  struct sched_domain *sd, enum cpu_idle_type idle,
-		  int *all_pinned, unsigned long *load_moved,
-		  int *this_best_prio, struct rq_iterator *iterator);
+#ifdef CONFIG_SMP
+static unsigned long
+balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
+	  unsigned long max_load_move, struct sched_domain *sd,
+	  enum cpu_idle_type idle, int *all_pinned,
+	  int *this_best_prio, struct rq_iterator *iterator);
+
+static int
+iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		   struct sched_domain *sd, enum cpu_idle_type idle,
+		   struct rq_iterator *iterator);
+#else
+static inline unsigned long
+balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
+	  unsigned long max_load_move, struct sched_domain *sd,
+	  enum cpu_idle_type idle, int *all_pinned,
+	  int *this_best_prio, struct rq_iterator *iterator)
+{
+	return 0;
+}
+
+static inline int
+iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		   struct sched_domain *sd, enum cpu_idle_type idle,
+		   struct rq_iterator *iterator)
+{
+	return 0;
+}
+#endif
 
 #include sched_stats.h
 #include sched_rt.c
@@ -2166,17 +2190,17 @@ int can_migrate_task(struct task_struct 
 	return 1;
 }
 
-static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		  unsigned long max_nr_move, unsigned long max_load_move,
-		  struct sched_domain *sd, enum cpu_idle_type idle,
-		  int *all_pinned, unsigned long *load_moved,
-		  int *this_best_prio, struct rq_iterator *iterator)
+static unsigned long
+balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
+	  unsigned long max_load_move, struct sched_domain *sd,
+	  enum cpu_idle_type idle, int *all_pinned,
+	  int *this_best_prio, struct rq_iterator *iterator)
 {
 	int pulled = 0, pinned = 0, skip_for_load;
 	struct task_struct *p;
 	long rem_load_move = max_load_move;
 
-	if (max_nr_move == 0 || max_load_move == 0)
+	if (max_load_move == 0)
 		goto out;
 
 	pinned = 1;
@@ -2209,7 +2233,7 @@ next:
 	 * We only want to steal up to the prescribed number of tasks
 	 * and the prescribed amount of weighted load.
 	 */
-	if (pulled  max_nr_move  rem_load_move  0) {
+	if (rem_load_move  0) {
 		if (p-prio  *this_best_prio)
 			*this_best_prio = p-prio;
 		p = iterator-next(iterator-arg);
@@ -2217,7 +2241,7 @@ next:
 	}
 out:
 	/*
-	 * Right now, this is the only place pull_task() is called,
+	 * Right now, this is one of only two places pull_task() is called,
 	 * so we can safely collect pull_task() stats here rather than
 	 * inside pull_task().
 	 */
@@ -2225,8 +2249,8 @@ out:
 
 	if (all_pinned)
 		*all_pinned = pinned;
-	*load_moved = max_load_move - rem_load_move;
-	return pulled;
+
+	return max_load_move - rem_load_move;
 }
 
 /*
@@ -2248,7 +2272,7 @@ static int move_tasks(struct rq *this_rq
 	do {
 		total_load_moved +=
 			class-load_balance(this_rq, this_cpu, busiest

[PATCH] sched: Fix bug in balance_tasks()

2007-08-06 Thread Peter Williams

There are two problems with balance_tasks() and how it used:

1. The variables best_prio and best_prio_seen (inherited from the old 
move_tasks()) were only required to handle problems caused by the 
active/expired arrays, the order in which they were processed and the 
possibility that the task with the highest priority could be on either. 
 These issues are no longer present and the extra overhead associated 
with their use is unnecessary (and possibly wrong).


2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same 
this_best_prio variable needs to be used by all scheduling classes or 
there is a risk of moving too much load.  E.g. if the highest priority 
task on this at the beginning is a fairly low priority task and the rt 
class migrates a task (during its turn) then that moved task becomes the 
new highest priority task on this_rq but when the sched_fair class 
initializes its copy of this_best_prio it will get the priority of the 
original highest priority task as, due to the run queue locks being 
held, the reschedule triggered by pull_task() will not have taken place. 
 This could result in inappropriate overriding of skip_for_load and 
excessive load being moved.


The attached patch addresses these problems by deleting all reference to 
best_prio and best_prio_seen and making this_best_prio a reference 
parameter to the various functions involved.


load_balance_fair() has also been modified so that this_best_prio is 
only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set.  This should 
preserve the effect of helping spread groups' higher priority tasks 
around the available CPUs while improving system performance when 
CONFIG_FAIR_GROUP_SCHED isn't set.


Signed-off-by: Peter Williams <[EMAIL PROTECTED]>

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

diff -r c39ddf75cd08 include/linux/sched.h
--- a/include/linux/sched.h	Mon Aug 06 16:08:52 2007 +1000
+++ b/include/linux/sched.h	Mon Aug 06 16:13:20 2007 +1000
@@ -870,7 +870,7 @@ struct sched_class {
 			struct rq *busiest,
 			unsigned long max_nr_move, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
-			int *all_pinned);
+			int *all_pinned, int *this_best_prio);
 
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
diff -r c39ddf75cd08 kernel/sched.c
--- a/kernel/sched.c	Mon Aug 06 16:08:52 2007 +1000
+++ b/kernel/sched.c	Mon Aug 06 16:52:59 2007 +1000
@@ -745,8 +745,7 @@ static int balance_tasks(struct rq *this
 		  unsigned long max_nr_move, unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
 		  int *all_pinned, unsigned long *load_moved,
-		  int this_best_prio, int best_prio, int best_prio_seen,
-		  struct rq_iterator *iterator);
+		  int *this_best_prio, struct rq_iterator *iterator);
 
 #include "sched_stats.h"
 #include "sched_rt.c"
@@ -2166,8 +2165,7 @@ static int balance_tasks(struct rq *this
 		  unsigned long max_nr_move, unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
 		  int *all_pinned, unsigned long *load_moved,
-		  int this_best_prio, int best_prio, int best_prio_seen,
-		  struct rq_iterator *iterator)
+		  int *this_best_prio, struct rq_iterator *iterator)
 {
 	int pulled = 0, pinned = 0, skip_for_load;
 	struct task_struct *p;
@@ -2192,12 +2190,8 @@ next:
 	 */
 	skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
 			 SCHED_LOAD_SCALE_FUZZ;
-	if (skip_for_load && p->prio < this_best_prio)
-		skip_for_load = !best_prio_seen && p->prio == best_prio;
-	if (skip_for_load ||
+	if ((skip_for_load && p->prio >= *this_best_prio) ||
 	!can_migrate_task(p, busiest, this_cpu, sd, idle, )) {
-
-		best_prio_seen |= p->prio == best_prio;
 		p = iterator->next(iterator->arg);
 		goto next;
 	}
@@ -2211,8 +2205,8 @@ next:
 	 * and the prescribed amount of weighted load.
 	 */
 	if (pulled < max_nr_move && rem_load_move > 0) {
-		if (p->prio < this_best_prio)
-			this_best_prio = p->prio;
+		if (p->prio < *this_best_prio)
+			*this_best_prio = p->prio;
 		p = iterator->next(iterator->arg);
 		goto next;
 	}
@@ -2244,12 +2238,13 @@ static int move_tasks(struct rq *this_rq
 {
 	struct sched_class *class = sched_class_highest;
 	unsigned long total_load_moved = 0;
+	int this_best_prio = this_rq->curr->prio;
 
 	do {
 		total_load_moved +=
 			class->load_balance(this_rq, this_cpu, busiest,
 ULONG_MAX, max_load_move - total_load_moved,
-sd, idle, all_pinned);
+sd, idle, all_pinned, _best_prio);
 		class = class->next;
 	} while (class && max_load_move > total_load_moved);
 
@@ -2267,10 +2262,12 @@ static int move_on

[PATCH] sched: Fix bug in balance_tasks()

2007-08-06 Thread Peter Williams

There are two problems with balance_tasks() and how it used:

1. The variables best_prio and best_prio_seen (inherited from the old 
move_tasks()) were only required to handle problems caused by the 
active/expired arrays, the order in which they were processed and the 
possibility that the task with the highest priority could be on either. 
 These issues are no longer present and the extra overhead associated 
with their use is unnecessary (and possibly wrong).


2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same 
this_best_prio variable needs to be used by all scheduling classes or 
there is a risk of moving too much load.  E.g. if the highest priority 
task on this at the beginning is a fairly low priority task and the rt 
class migrates a task (during its turn) then that moved task becomes the 
new highest priority task on this_rq but when the sched_fair class 
initializes its copy of this_best_prio it will get the priority of the 
original highest priority task as, due to the run queue locks being 
held, the reschedule triggered by pull_task() will not have taken place. 
 This could result in inappropriate overriding of skip_for_load and 
excessive load being moved.


The attached patch addresses these problems by deleting all reference to 
best_prio and best_prio_seen and making this_best_prio a reference 
parameter to the various functions involved.


load_balance_fair() has also been modified so that this_best_prio is 
only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set.  This should 
preserve the effect of helping spread groups' higher priority tasks 
around the available CPUs while improving system performance when 
CONFIG_FAIR_GROUP_SCHED isn't set.


Signed-off-by: Peter Williams [EMAIL PROTECTED]

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
  -- Ambrose Bierce

diff -r c39ddf75cd08 include/linux/sched.h
--- a/include/linux/sched.h	Mon Aug 06 16:08:52 2007 +1000
+++ b/include/linux/sched.h	Mon Aug 06 16:13:20 2007 +1000
@@ -870,7 +870,7 @@ struct sched_class {
 			struct rq *busiest,
 			unsigned long max_nr_move, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
-			int *all_pinned);
+			int *all_pinned, int *this_best_prio);
 
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
diff -r c39ddf75cd08 kernel/sched.c
--- a/kernel/sched.c	Mon Aug 06 16:08:52 2007 +1000
+++ b/kernel/sched.c	Mon Aug 06 16:52:59 2007 +1000
@@ -745,8 +745,7 @@ static int balance_tasks(struct rq *this
 		  unsigned long max_nr_move, unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
 		  int *all_pinned, unsigned long *load_moved,
-		  int this_best_prio, int best_prio, int best_prio_seen,
-		  struct rq_iterator *iterator);
+		  int *this_best_prio, struct rq_iterator *iterator);
 
 #include sched_stats.h
 #include sched_rt.c
@@ -2166,8 +2165,7 @@ static int balance_tasks(struct rq *this
 		  unsigned long max_nr_move, unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
 		  int *all_pinned, unsigned long *load_moved,
-		  int this_best_prio, int best_prio, int best_prio_seen,
-		  struct rq_iterator *iterator)
+		  int *this_best_prio, struct rq_iterator *iterator)
 {
 	int pulled = 0, pinned = 0, skip_for_load;
 	struct task_struct *p;
@@ -2192,12 +2190,8 @@ next:
 	 */
 	skip_for_load = (p-se.load.weight  1)  rem_load_move +
 			 SCHED_LOAD_SCALE_FUZZ;
-	if (skip_for_load  p-prio  this_best_prio)
-		skip_for_load = !best_prio_seen  p-prio == best_prio;
-	if (skip_for_load ||
+	if ((skip_for_load  p-prio = *this_best_prio) ||
 	!can_migrate_task(p, busiest, this_cpu, sd, idle, pinned)) {
-
-		best_prio_seen |= p-prio == best_prio;
 		p = iterator-next(iterator-arg);
 		goto next;
 	}
@@ -2211,8 +2205,8 @@ next:
 	 * and the prescribed amount of weighted load.
 	 */
 	if (pulled  max_nr_move  rem_load_move  0) {
-		if (p-prio  this_best_prio)
-			this_best_prio = p-prio;
+		if (p-prio  *this_best_prio)
+			*this_best_prio = p-prio;
 		p = iterator-next(iterator-arg);
 		goto next;
 	}
@@ -2244,12 +2238,13 @@ static int move_tasks(struct rq *this_rq
 {
 	struct sched_class *class = sched_class_highest;
 	unsigned long total_load_moved = 0;
+	int this_best_prio = this_rq-curr-prio;
 
 	do {
 		total_load_moved +=
 			class-load_balance(this_rq, this_cpu, busiest,
 ULONG_MAX, max_load_move - total_load_moved,
-sd, idle, all_pinned);
+sd, idle, all_pinned, this_best_prio);
 		class = class-next;
 	} while (class  max_load_move  total_load_moved);
 
@@ -2267,10 +2262,12 @@ static int move_one_task(struct rq *this
 			 struct sched_domain *sd, enum cpu_idle_type idle)
 {
 	struct sched_class *class;
+	int this_best_prio = MAX_PRIO;
 
 	for (class = sched_class_highest; class; class = class-next

Possible error in 2.6.23-rc2-rt1 series

2007-08-05 Thread Peter Williams
I've just been reviewing these patches and have spotted a possible
error in the file arch/ia64/kernel/time.c in that the scope of the
#ifdef on CONFIG_TIME_INTERPOLATION seems to have grown quite a lot
since 2.2.23-rc1-rt7.  It used to chop out one if statement and now it
chops out half the file.

Is it correct?
Peter
-- 
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Possible error in 2.6.23-rc2-rt1 series

2007-08-05 Thread Peter Williams
I've just been reviewing these patches and have spotted a possible
error in the file arch/ia64/kernel/time.c in that the scope of the
#ifdef on CONFIG_TIME_INTERPOLATION seems to have grown quite a lot
since 2.2.23-rc1-rt7.  It used to chop out one if statement and now it
chops out half the file.

Is it correct?
Peter
-- 
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: Simplify move_tasks()

2007-08-03 Thread Peter Williams
The move_tasks() function is currently multiplexed with two distinct 
capabilities:


1. attempt to move a specified amount of weighted load from one run 
queue to another; and
2. attempt to move a specified number of tasks from one run queue to 
another.


The first of these capabilities is used in two places, load_balance() 
and load_balance_idle(), and in both of these cases the return value of 
move_tasks() is used purely to decide if tasks/load were moved and no 
notice of the actual number of tasks moved is taken.


The second capability is used in exactly one place, 
active_load_balance(), to attempt to move exactly one task and, as 
before, the return value is only used as an indicator of success or failure.


This multiplexing of sched_task() was introduced, by me, as part of the 
smpnice patches and was motivated by the fact that the alternative, one 
function to move specified load and one to move a single task, would 
have led to two functions of roughly the same complexity as the old 
move_tasks() (or the new balance_tasks()).  However, the new modular 
design of the new CFS scheduler allows a simpler solution to be adopted 
and this patch addresses that solution by:


1. adding a new function, move_one_task(), to be used by 
active_load_balance(); and
2. making move_tasks() a single purpose function that tries to move a 
specified weighted load and returns 1 for success and 0 for failure.


One of the consequences of these changes is that neither move_one_task()
or the new move_tasks() care how many tasks sched_class.load_balance() 
moves and this enables its interface to be simplified by returning the 
amount of load moved as its result and removing the load_moved pointer 
from the argument list.  This helps simplify the new move_tasks() and 
slightly reduces the amount of work done in each of 
sched_class.load_balance()'s implementations.


Further simplification, e.g. changes to balance_tasks(), are possible 
but (slightly) complicated by the special needs of load_balance_fair() 
so I've left them to a later patch (if this one gets accepted).


NB Since move_tasks() gets called with two run queue locks held even 
small reductions in overhead are worthwhile.


Signed-off-by: Peter Williams <[EMAIL PROTECTED]>

--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
diff -r b97e7dab8f7b include/linux/sched.h
--- a/include/linux/sched.h	Thu Aug 02 14:08:53 2007 -0700
+++ b/include/linux/sched.h	Fri Aug 03 15:56:41 2007 +1000
@@ -866,11 +866,11 @@ struct sched_class {
 	struct task_struct * (*pick_next_task) (struct rq *rq, u64 now);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p, u64 now);
 
-	int (*load_balance) (struct rq *this_rq, int this_cpu,
+	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
 			struct rq *busiest,
 			unsigned long max_nr_move, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
-			int *all_pinned, unsigned long *total_load_moved);
+			int *all_pinned);
 
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
diff -r b97e7dab8f7b kernel/sched.c
--- a/kernel/sched.c	Thu Aug 02 14:08:53 2007 -0700
+++ b/kernel/sched.c	Sat Aug 04 10:06:42 2007 +1000
@@ -2231,32 +2231,49 @@ out:
 }
 
 /*
- * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
- * load from busiest to this_rq, as part of a balancing operation within
- * "domain". Returns the number of tasks moved.
+ * move_tasks tries to move up to max_load_move weighted load from busiest to
+ * this_rq, as part of a balancing operation within domain "sd".
+ * Returns 1 if successful and 0 otherwise.
  *
  * Called with both runqueues locked.
  */
 static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		  unsigned long max_nr_move, unsigned long max_load_move,
+		  unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
 		  int *all_pinned)
 {
 	struct sched_class *class = sched_class_highest;
-	unsigned long load_moved, total_nr_moved = 0, nr_moved;
-	long rem_load_move = max_load_move;
+	unsigned long total_load_moved = 0;
 
 	do {
-		nr_moved = class->load_balance(this_rq, this_cpu, busiest,
-max_nr_move, (unsigned long)rem_load_move,
-sd, idle, all_pinned, _moved);
-		total_nr_moved += nr_moved;
-		max_nr_move -= nr_moved;
-		rem_load_move -= load_moved;
+		total_load_moved +=
+			class->load_balance(this_rq, this_cpu, busiest,
+ULONG_MAX, max_load_move - total_load_moved,
+sd, idle, all_pinned);
 		class = class->next;
-	} while (class && max_nr_move && rem_load_move > 0);
-
-	return total_nr_moved;
+	} while (class && max_load_move > total_load_moved);
+
+	return total_load_moved > 0;
+}
+
+/*
+ * move_one_task tries to 

[PATCH] sched: Simplify move_tasks()

2007-08-03 Thread Peter Williams
The move_tasks() function is currently multiplexed with two distinct 
capabilities:


1. attempt to move a specified amount of weighted load from one run 
queue to another; and
2. attempt to move a specified number of tasks from one run queue to 
another.


The first of these capabilities is used in two places, load_balance() 
and load_balance_idle(), and in both of these cases the return value of 
move_tasks() is used purely to decide if tasks/load were moved and no 
notice of the actual number of tasks moved is taken.


The second capability is used in exactly one place, 
active_load_balance(), to attempt to move exactly one task and, as 
before, the return value is only used as an indicator of success or failure.


This multiplexing of sched_task() was introduced, by me, as part of the 
smpnice patches and was motivated by the fact that the alternative, one 
function to move specified load and one to move a single task, would 
have led to two functions of roughly the same complexity as the old 
move_tasks() (or the new balance_tasks()).  However, the new modular 
design of the new CFS scheduler allows a simpler solution to be adopted 
and this patch addresses that solution by:


1. adding a new function, move_one_task(), to be used by 
active_load_balance(); and
2. making move_tasks() a single purpose function that tries to move a 
specified weighted load and returns 1 for success and 0 for failure.


One of the consequences of these changes is that neither move_one_task()
or the new move_tasks() care how many tasks sched_class.load_balance() 
moves and this enables its interface to be simplified by returning the 
amount of load moved as its result and removing the load_moved pointer 
from the argument list.  This helps simplify the new move_tasks() and 
slightly reduces the amount of work done in each of 
sched_class.load_balance()'s implementations.


Further simplification, e.g. changes to balance_tasks(), are possible 
but (slightly) complicated by the special needs of load_balance_fair() 
so I've left them to a later patch (if this one gets accepted).


NB Since move_tasks() gets called with two run queue locks held even 
small reductions in overhead are worthwhile.


Signed-off-by: Peter Williams [EMAIL PROTECTED]

--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
diff -r b97e7dab8f7b include/linux/sched.h
--- a/include/linux/sched.h	Thu Aug 02 14:08:53 2007 -0700
+++ b/include/linux/sched.h	Fri Aug 03 15:56:41 2007 +1000
@@ -866,11 +866,11 @@ struct sched_class {
 	struct task_struct * (*pick_next_task) (struct rq *rq, u64 now);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p, u64 now);
 
-	int (*load_balance) (struct rq *this_rq, int this_cpu,
+	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
 			struct rq *busiest,
 			unsigned long max_nr_move, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
-			int *all_pinned, unsigned long *total_load_moved);
+			int *all_pinned);
 
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
diff -r b97e7dab8f7b kernel/sched.c
--- a/kernel/sched.c	Thu Aug 02 14:08:53 2007 -0700
+++ b/kernel/sched.c	Sat Aug 04 10:06:42 2007 +1000
@@ -2231,32 +2231,49 @@ out:
 }
 
 /*
- * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
- * load from busiest to this_rq, as part of a balancing operation within
- * domain. Returns the number of tasks moved.
+ * move_tasks tries to move up to max_load_move weighted load from busiest to
+ * this_rq, as part of a balancing operation within domain sd.
+ * Returns 1 if successful and 0 otherwise.
  *
  * Called with both runqueues locked.
  */
 static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
-		  unsigned long max_nr_move, unsigned long max_load_move,
+		  unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
 		  int *all_pinned)
 {
 	struct sched_class *class = sched_class_highest;
-	unsigned long load_moved, total_nr_moved = 0, nr_moved;
-	long rem_load_move = max_load_move;
+	unsigned long total_load_moved = 0;
 
 	do {
-		nr_moved = class-load_balance(this_rq, this_cpu, busiest,
-max_nr_move, (unsigned long)rem_load_move,
-sd, idle, all_pinned, load_moved);
-		total_nr_moved += nr_moved;
-		max_nr_move -= nr_moved;
-		rem_load_move -= load_moved;
+		total_load_moved +=
+			class-load_balance(this_rq, this_cpu, busiest,
+ULONG_MAX, max_load_move - total_load_moved,
+sd, idle, all_pinned);
 		class = class-next;
-	} while (class  max_nr_move  rem_load_move  0);
-
-	return total_nr_moved;
+	} while (class  max_load_move  total_load_moved);
+
+	return total_load_moved  0;
+}
+
+/*
+ * move_one_task tries to move exactly one task from busiest to this_rq, as
+ * part of active balancing operations within domain

[PATCH] Tidy up left over smpnice code after changes introduced with CFS

2007-08-02 Thread Peter Williams
1. The only place that RTPRIO_TO_LOAD_WEIGHT() is used is in the call to 
move_tasks() in the function active_load_balance() and its purpose here 
is just to make sure that the load to be moved is big enough to ensure 
that exactly one task is moved (if there's one available).  This can be 
accomplished by using ULONG_MAX instead and this allows 
RTPRIO_TO_LOAD_WEIGHT() to be deleted.


2. This, in turn, allows PRIO_TO_LOAD_WEIGHT() to be deleted.

3. This allows load_weight() to be deleted which allows 
TIME_SLICE_NICE_ZERO to be deleted along with the comment above it.


Signed-off-by: Peter Williams <[EMAIL PROTECTED]>

--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
diff -r 622a128d084b kernel/sched.c
--- a/kernel/sched.c	Mon Jul 30 21:54:37 2007 -0700
+++ b/kernel/sched.c	Thu Aug 02 16:21:19 2007 +1000
@@ -727,19 +727,6 @@ static void update_curr_load(struct rq *
  * slice expiry etc.
  */
 
-/*
- * Assume: static_prio_timeslice(NICE_TO_PRIO(0)) == DEF_TIMESLICE
- * If static_prio_timeslice() is ever changed to break this assumption then
- * this code will need modification
- */
-#define TIME_SLICE_NICE_ZERO DEF_TIMESLICE
-#define load_weight(lp) \
-	(((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO)
-#define PRIO_TO_LOAD_WEIGHT(prio) \
-	load_weight(static_prio_timeslice(prio))
-#define RTPRIO_TO_LOAD_WEIGHT(rp) \
-	(PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + load_weight(rp))
-
 #define WEIGHT_IDLEPRIO		2
 #define WMULT_IDLEPRIO		(1 << 31)
 
@@ -2906,8 +2893,7 @@ static void active_load_balance(struct r
 		schedstat_inc(sd, alb_cnt);
 
 		if (move_tasks(target_rq, target_cpu, busiest_rq, 1,
-			   RTPRIO_TO_LOAD_WEIGHT(100), sd, CPU_IDLE,
-			   NULL))
+			   ULONG_MAX, sd, CPU_IDLE, NULL))
 			schedstat_inc(sd, alb_pushed);
 		else
 			schedstat_inc(sd, alb_failed);


[PATCH] Tidy up left over smpnice code after changes introduced with CFS

2007-08-02 Thread Peter Williams
1. The only place that RTPRIO_TO_LOAD_WEIGHT() is used is in the call to 
move_tasks() in the function active_load_balance() and its purpose here 
is just to make sure that the load to be moved is big enough to ensure 
that exactly one task is moved (if there's one available).  This can be 
accomplished by using ULONG_MAX instead and this allows 
RTPRIO_TO_LOAD_WEIGHT() to be deleted.


2. This, in turn, allows PRIO_TO_LOAD_WEIGHT() to be deleted.

3. This allows load_weight() to be deleted which allows 
TIME_SLICE_NICE_ZERO to be deleted along with the comment above it.


Signed-off-by: Peter Williams [EMAIL PROTECTED]

--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
diff -r 622a128d084b kernel/sched.c
--- a/kernel/sched.c	Mon Jul 30 21:54:37 2007 -0700
+++ b/kernel/sched.c	Thu Aug 02 16:21:19 2007 +1000
@@ -727,19 +727,6 @@ static void update_curr_load(struct rq *
  * slice expiry etc.
  */
 
-/*
- * Assume: static_prio_timeslice(NICE_TO_PRIO(0)) == DEF_TIMESLICE
- * If static_prio_timeslice() is ever changed to break this assumption then
- * this code will need modification
- */
-#define TIME_SLICE_NICE_ZERO DEF_TIMESLICE
-#define load_weight(lp) \
-	(((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO)
-#define PRIO_TO_LOAD_WEIGHT(prio) \
-	load_weight(static_prio_timeslice(prio))
-#define RTPRIO_TO_LOAD_WEIGHT(rp) \
-	(PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + load_weight(rp))
-
 #define WEIGHT_IDLEPRIO		2
 #define WMULT_IDLEPRIO		(1  31)
 
@@ -2906,8 +2893,7 @@ static void active_load_balance(struct r
 		schedstat_inc(sd, alb_cnt);
 
 		if (move_tasks(target_rq, target_cpu, busiest_rq, 1,
-			   RTPRIO_TO_LOAD_WEIGHT(100), sd, CPU_IDLE,
-			   NULL))
+			   ULONG_MAX, sd, CPU_IDLE, NULL))
 			schedstat_inc(sd, alb_pushed);
 		else
 			schedstat_inc(sd, alb_failed);


Minor errors in 2.6.23-rc1-rt2 series

2007-07-25 Thread Peter Williams
I've just been reviewing these patches and have spotted a couple of
errors that look like they were caused by fuzz during the patch process.

A patch that corrects the errors is attached.

Cheers
Peter
-- 
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

diff -r e02fd64426b9 arch/i386/boot/compressed/Makefile
--- a/arch/i386/boot/compressed/MakefileThu Jul 26 10:33:58 2007 +1000
+++ b/arch/i386/boot/compressed/MakefileThu Jul 26 11:17:35 2007 +1000
@@ -9,10 +9,9 @@ EXTRA_AFLAGS   := -traditional
 EXTRA_AFLAGS   := -traditional
 
 LDFLAGS_vmlinux := -T
-CFLAGS := -m32 -D__KERNEL__ -Iinclude -O2  -fno-strict-aliasing
 hostprogs-y:= relocs
 
-CFLAGS  := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -O2 \
+CFLAGS  := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -Iinclude -O2 \
   -fno-strict-aliasing -fPIC \
   $(call cc-option,-ffreestanding) \
   $(call cc-option,-fno-stack-protector)
diff -r e02fd64426b9 arch/i386/kernel/smp.c
--- a/arch/i386/kernel/smp.cThu Jul 26 10:33:58 2007 +1000
+++ b/arch/i386/kernel/smp.cThu Jul 26 11:17:35 2007 +1000
@@ -651,7 +651,6 @@ fastcall notrace void smp_reschedule_int
 fastcall notrace void smp_reschedule_interrupt(struct pt_regs *regs)
 {
trace_special(regs->eip, 0, 0);
-   trace_special(regs->eip, 0, 0);
ack_APIC_irq();
set_tsk_need_resched(current);
 }
diff -r e02fd64426b9 include/asm-mips/mipsregs.h
--- a/include/asm-mips/mipsregs.h   Thu Jul 26 10:33:58 2007 +1000
+++ b/include/asm-mips/mipsregs.h   Thu Jul 26 11:17:35 2007 +1000
@@ -710,7 +710,7 @@ do {
\
unsigned long long __val;   \
unsigned long __flags;  \
\
-   local_irq_save(flags);  \
+   local_irq_save(__flags);\
if (sel == 0)   \
__asm__ __volatile__(   \
".set\tmips64\n\t"  \


Minor errors in 2.6.23-rc1-rt2 series

2007-07-25 Thread Peter Williams
I've just been reviewing these patches and have spotted a couple of
errors that look like they were caused by fuzz during the patch process.

A patch that corrects the errors is attached.

Cheers
Peter
-- 
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce

diff -r e02fd64426b9 arch/i386/boot/compressed/Makefile
--- a/arch/i386/boot/compressed/MakefileThu Jul 26 10:33:58 2007 +1000
+++ b/arch/i386/boot/compressed/MakefileThu Jul 26 11:17:35 2007 +1000
@@ -9,10 +9,9 @@ EXTRA_AFLAGS   := -traditional
 EXTRA_AFLAGS   := -traditional
 
 LDFLAGS_vmlinux := -T
-CFLAGS := -m32 -D__KERNEL__ -Iinclude -O2  -fno-strict-aliasing
 hostprogs-y:= relocs
 
-CFLAGS  := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -O2 \
+CFLAGS  := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -Iinclude -O2 \
   -fno-strict-aliasing -fPIC \
   $(call cc-option,-ffreestanding) \
   $(call cc-option,-fno-stack-protector)
diff -r e02fd64426b9 arch/i386/kernel/smp.c
--- a/arch/i386/kernel/smp.cThu Jul 26 10:33:58 2007 +1000
+++ b/arch/i386/kernel/smp.cThu Jul 26 11:17:35 2007 +1000
@@ -651,7 +651,6 @@ fastcall notrace void smp_reschedule_int
 fastcall notrace void smp_reschedule_interrupt(struct pt_regs *regs)
 {
trace_special(regs-eip, 0, 0);
-   trace_special(regs-eip, 0, 0);
ack_APIC_irq();
set_tsk_need_resched(current);
 }
diff -r e02fd64426b9 include/asm-mips/mipsregs.h
--- a/include/asm-mips/mipsregs.h   Thu Jul 26 10:33:58 2007 +1000
+++ b/include/asm-mips/mipsregs.h   Thu Jul 26 11:17:35 2007 +1000
@@ -710,7 +710,7 @@ do {
\
unsigned long long __val;   \
unsigned long __flags;  \
\
-   local_irq_save(flags);  \
+   local_irq_save(__flags);\
if (sel == 0)   \
__asm__ __volatile__(   \
.set\tmips64\n\t  \


Re: [ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.22

2007-07-16 Thread Peter Williams
Ingo Molnar wrote:
> * Peter Williams <[EMAIL PROTECTED]> wrote:
> 
>> Probably the last one now that CFS is in the main line :-(.
> 
> hm, why is CFS in mainline a problem?

It means a major rewrite of the plugsched interface and I'm not sure
that it's worth it (if CFS works well).  However, note that I did say
probably not definitely :-).  I'll play with it and see what happens.

> The CFS merge should make the life 
> of development/test patches like plugsched conceptually easier. (it will 
> certainly cause a lot of churn, but that's for the better i think.)

I don't think that is necessarily the case.

> 
> Most of the schedulers in plugsched should be readily adaptable to the 
> modular scheduling-policy scheme of the upstream scheduler.

I don't think that this necessarily true.  Ingosched and ingo_ll are
definitely out and I don't feel like converting staircase and nicksched
as I have no real interest in them.  Perhaps I'll just create the
interface and some schedulers based on my own ideas and let others such
as Con and Nick add schedulers if they're still that way inclined.

> I'm sure 
> there will be some minor issues as isolation of the modules is not 
> enforced right now - and i'd be happy to review (and potentially apply) 
> common-sense patches that improve the framework.

Thanks for the offer of support (it may sway my decision),
Peter
-- 
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.22

2007-07-16 Thread Peter Williams
Ingo Molnar wrote:
 * Peter Williams [EMAIL PROTECTED] wrote:
 
 Probably the last one now that CFS is in the main line :-(.
 
 hm, why is CFS in mainline a problem?

It means a major rewrite of the plugsched interface and I'm not sure
that it's worth it (if CFS works well).  However, note that I did say
probably not definitely :-).  I'll play with it and see what happens.

 The CFS merge should make the life 
 of development/test patches like plugsched conceptually easier. (it will 
 certainly cause a lot of churn, but that's for the better i think.)

I don't think that is necessarily the case.

 
 Most of the schedulers in plugsched should be readily adaptable to the 
 modular scheduling-policy scheme of the upstream scheduler.

I don't think that this necessarily true.  Ingosched and ingo_ll are
definitely out and I don't feel like converting staircase and nicksched
as I have no real interest in them.  Perhaps I'll just create the
interface and some schedulers based on my own ideas and let others such
as Con and Nick add schedulers if they're still that way inclined.

 I'm sure 
 there will be some minor issues as isolation of the modules is not 
 enforced right now - and i'd be happy to review (and potentially apply) 
 common-sense patches that improve the framework.

Thanks for the offer of support (it may sway my decision),
Peter
-- 
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Forward port of latest RT patch (2.6.21.5-rt20) to 2.6.22 available

2007-07-13 Thread Peter Williams
Gene Heskett wrote:
> On Friday 13 July 2007, Peter Williams wrote:
>> Ingo Molnar wrote:
>>> * Gregory Haskins <[EMAIL PROTECTED]> wrote:
>>>> On Thu, 2007-07-12 at 14:07 +0200, Ingo Molnar wrote:
>>>>> * Gregory Haskins <[EMAIL PROTECTED]> wrote:
>>>>>> Hi Ingo, Thomas, and the greater linux-rt community,
>>>>>>
>>>>>>  I just wanted to let you guys know that our team has a port of
>>>>>> the 21.5-rt20 patch for the 2.6.22 kernel available. [...]
>>>>> great! We had the upstream -rt port to .22 in the works too, it was just
>>>>> held up by the hpet breakage - which Thomas managed to fix earlier
>>>>> today. I've released the 2.6.22.1-rt1 patch to the usual place:
>>>>>
>>>>> http://redhat.com/~mingo/realtime-preempt/
>>>> Thats awesome, Ingo!  Thanks!  Could you publish a broken out version
>>>> as well?  We found it extremely valuable to be able to bisect this
>>>> beast while working on the 21-22 port.
>>> we are working on something in this area :) Stay tuned ...
>> I've just been reviewing these patches and have spotted an error in the
>> file mm/slob.c at lines 500-501 whereby a non existent variable "c" is
>> referenced.  The attached patch is a proposed fix to the problem.
> 
> Could this explain why 2.6.22.1-rt1 seems to use a lot of swap?  I've been as 
> high as 570 megs into swap, currently at 286megs after doing a 
> swapoff --a;swapon -a about 8 hours ago.

No.  This problem would have caused the build to fail if slob was
configured.

Peter
-- 
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Forward port of latest RT patch (2.6.21.5-rt20) to 2.6.22 available

2007-07-13 Thread Peter Williams
Ingo Molnar wrote:
> * Gregory Haskins <[EMAIL PROTECTED]> wrote:
> 
>> On Thu, 2007-07-12 at 14:07 +0200, Ingo Molnar wrote:
>>> * Gregory Haskins <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi Ingo, Thomas, and the greater linux-rt community,
>>>>   
>>>>I just wanted to let you guys know that our team has a port of 
>>>> the 21.5-rt20 patch for the 2.6.22 kernel available. [...]
>>> great! We had the upstream -rt port to .22 in the works too, it was just 
>>> held up by the hpet breakage - which Thomas managed to fix earlier 
>>> today. I've released the 2.6.22.1-rt1 patch to the usual place:
>>>
>>> http://redhat.com/~mingo/realtime-preempt/
>> Thats awesome, Ingo!  Thanks!  Could you publish a broken out version 
>> as well?  We found it extremely valuable to be able to bisect this 
>> beast while working on the 21-22 port.
> 
> we are working on something in this area :) Stay tuned ...

I've just been reviewing these patches and have spotted an error in the
file mm/slob.c at lines 500-501 whereby a non existent variable "c" is
referenced.  The attached patch is a proposed fix to the problem.

-- 
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
Fix error in realtime-preempt patch for mm/slob.c

This error was caused by a change to slob_free()'s interface.

Signed-off-by: Peter Williams <[EMAIL PROTECTED]>

diff -r cb0010b7bffe mm/slob.c
--- a/mm/slob.c	Fri Jul 13 15:24:45 2007 +1000
+++ b/mm/slob.c	Fri Jul 13 16:23:02 2007 +1000
@@ -493,14 +493,14 @@ void *kmem_cache_zalloc(struct kmem_cach
 }
 EXPORT_SYMBOL(kmem_cache_zalloc);
 
-static void __kmem_cache_free(void *b, int size)
+static void __kmem_cache_free(struct kmem_cache *c, void *b)
 {
 	atomic_dec(>items);
 
 	if (c->size <= MAX_SLOB_CACHE_SIZE)
 		slob_free(c, b, c->size);
 	else
-		free_pages((unsigned long)b, get_order(size));
+		free_pages((unsigned long)b, get_order(c->size));
 }
 
 static void kmem_rcu_free(struct rcu_head *head)
@@ -508,7 +508,7 @@ static void kmem_rcu_free(struct rcu_hea
 	struct slob_rcu *slob_rcu = (struct slob_rcu *)head;
 	void *b = (void *)slob_rcu - (slob_rcu->size - sizeof(struct slob_rcu));
 
-	__kmem_cache_free(b, slob_rcu->size);
+	__kmem_cache_free(slob_rcu, b);
 }
 
 void kmem_cache_free(struct kmem_cache *c, void *b)
@@ -520,7 +520,7 @@ void kmem_cache_free(struct kmem_cache *
 		slob_rcu->size = c->size;
 		call_rcu(_rcu->head, kmem_rcu_free);
 	} else {
-		__kmem_cache_free(b, c->size);
+		__kmem_cache_free(c, b);
 	}
 }
 EXPORT_SYMBOL(kmem_cache_free);


Re: Forward port of latest RT patch (2.6.21.5-rt20) to 2.6.22 available

2007-07-13 Thread Peter Williams
Ingo Molnar wrote:
 * Gregory Haskins [EMAIL PROTECTED] wrote:
 
 On Thu, 2007-07-12 at 14:07 +0200, Ingo Molnar wrote:
 * Gregory Haskins [EMAIL PROTECTED] wrote:

 Hi Ingo, Thomas, and the greater linux-rt community,
   
I just wanted to let you guys know that our team has a port of 
 the 21.5-rt20 patch for the 2.6.22 kernel available. [...]
 great! We had the upstream -rt port to .22 in the works too, it was just 
 held up by the hpet breakage - which Thomas managed to fix earlier 
 today. I've released the 2.6.22.1-rt1 patch to the usual place:

 http://redhat.com/~mingo/realtime-preempt/
 Thats awesome, Ingo!  Thanks!  Could you publish a broken out version 
 as well?  We found it extremely valuable to be able to bisect this 
 beast while working on the 21-22 port.
 
 we are working on something in this area :) Stay tuned ...

I've just been reviewing these patches and have spotted an error in the
file mm/slob.c at lines 500-501 whereby a non existent variable c is
referenced.  The attached patch is a proposed fix to the problem.

-- 
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
Fix error in realtime-preempt patch for mm/slob.c

This error was caused by a change to slob_free()'s interface.

Signed-off-by: Peter Williams [EMAIL PROTECTED]

diff -r cb0010b7bffe mm/slob.c
--- a/mm/slob.c	Fri Jul 13 15:24:45 2007 +1000
+++ b/mm/slob.c	Fri Jul 13 16:23:02 2007 +1000
@@ -493,14 +493,14 @@ void *kmem_cache_zalloc(struct kmem_cach
 }
 EXPORT_SYMBOL(kmem_cache_zalloc);
 
-static void __kmem_cache_free(void *b, int size)
+static void __kmem_cache_free(struct kmem_cache *c, void *b)
 {
 	atomic_dec(c-items);
 
 	if (c-size = MAX_SLOB_CACHE_SIZE)
 		slob_free(c, b, c-size);
 	else
-		free_pages((unsigned long)b, get_order(size));
+		free_pages((unsigned long)b, get_order(c-size));
 }
 
 static void kmem_rcu_free(struct rcu_head *head)
@@ -508,7 +508,7 @@ static void kmem_rcu_free(struct rcu_hea
 	struct slob_rcu *slob_rcu = (struct slob_rcu *)head;
 	void *b = (void *)slob_rcu - (slob_rcu-size - sizeof(struct slob_rcu));
 
-	__kmem_cache_free(b, slob_rcu-size);
+	__kmem_cache_free(slob_rcu, b);
 }
 
 void kmem_cache_free(struct kmem_cache *c, void *b)
@@ -520,7 +520,7 @@ void kmem_cache_free(struct kmem_cache *
 		slob_rcu-size = c-size;
 		call_rcu(slob_rcu-head, kmem_rcu_free);
 	} else {
-		__kmem_cache_free(b, c-size);
+		__kmem_cache_free(c, b);
 	}
 }
 EXPORT_SYMBOL(kmem_cache_free);


Re: Forward port of latest RT patch (2.6.21.5-rt20) to 2.6.22 available

2007-07-13 Thread Peter Williams
Gene Heskett wrote:
 On Friday 13 July 2007, Peter Williams wrote:
 Ingo Molnar wrote:
 * Gregory Haskins [EMAIL PROTECTED] wrote:
 On Thu, 2007-07-12 at 14:07 +0200, Ingo Molnar wrote:
 * Gregory Haskins [EMAIL PROTECTED] wrote:
 Hi Ingo, Thomas, and the greater linux-rt community,

  I just wanted to let you guys know that our team has a port of
 the 21.5-rt20 patch for the 2.6.22 kernel available. [...]
 great! We had the upstream -rt port to .22 in the works too, it was just
 held up by the hpet breakage - which Thomas managed to fix earlier
 today. I've released the 2.6.22.1-rt1 patch to the usual place:

 http://redhat.com/~mingo/realtime-preempt/
 Thats awesome, Ingo!  Thanks!  Could you publish a broken out version
 as well?  We found it extremely valuable to be able to bisect this
 beast while working on the 21-22 port.
 we are working on something in this area :) Stay tuned ...
 I've just been reviewing these patches and have spotted an error in the
 file mm/slob.c at lines 500-501 whereby a non existent variable c is
 referenced.  The attached patch is a proposed fix to the problem.
 
 Could this explain why 2.6.22.1-rt1 seems to use a lot of swap?  I've been as 
 high as 570 megs into swap, currently at 286megs after doing a 
 swapoff --a;swapon -a about 8 hours ago.

No.  This problem would have caused the build to fail if slob was
configured.

Peter
-- 
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.22

2007-07-11 Thread Peter Williams

Probably the last one now that CFS is in the main line :-(.

A patch for 2.6.22 is available at:

<http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.22.patch>

Very Brief Documentation:

You can select a default scheduler at kernel build time.  If you wish to
boot with a scheduler other than the default it can be selected at boot
time by adding:

cpusched=

to the boot command line where  is one of: ingosched,
ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs
or zaphod.  If you don't change the default when you build the kernel
the default scheduler will be ingosched (which is the normal scheduler).

The scheduler in force on a running system can be determined by the
contents of:

/proc/scheduler

Control parameters for the scheduler can be read/set via files in:

/sys/cpusched//

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.22

2007-07-11 Thread Peter Williams

Probably the last one now that CFS is in the main line :-(.

A patch for 2.6.22 is available at:

http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.22.patch

Very Brief Documentation:

You can select a default scheduler at kernel build time.  If you wish to
boot with a scheduler other than the default it can be selected at boot
time by adding:

cpusched=scheduler

to the boot command line where scheduler is one of: ingosched,
ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs
or zaphod.  If you don't change the default when you build the kernel
the default scheduler will be ingosched (which is the normal scheduler).

The scheduler in force on a running system can be determined by the
contents of:

/proc/scheduler

Control parameters for the scheduler can be read/set via files in:

/sys/cpusched/scheduler/

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-30 Thread Peter Williams

Siddha, Suresh B wrote:

On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:

Siddha, Suresh B wrote:

I can try 32-bit kernel to check.

Don't bother.  I just checked 2.6.22-rc3 and the problem is not present
which means something between rc2 and rc3 has fixed the problem.  I hate
it when problems (appear to) fix themselves as it usually means they're
just hiding.

I didn't see any patches between rc2 and rc3 that were likely to have
fixed this (but doesn't mean there wasn't one).  I'm wondering whether I
should do a git bisect to see if I can find where it got fixed?

Could you see if you can reproduce it on 2.6.22-rc2?


No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
system at office. 15 attempts didn't show the issue.

Sure that nothing changed in your test setup?

More experiments tomorrow morning..


I've finished bisecting and the patch at which things appear to improve 
is cd5477911fc9f5cc64678e2b95cdd606c59a11b5 which is in the middle of a 
bunch of patches reorganizing the link phase of the build.  Patch 
description is:


kbuild: add "Section mismatch" warning whitelist for powerpc
author  Li Yang <[EMAIL PROTECTED]>
Mon, 14 May 2007 10:04:28 + (18:04 +0800)
committer   Sam Ravnborg <[EMAIL PROTECTED]>
Sat, 19 May 2007 07:11:57 + (09:11 +0200)
commit  cd5477911fc9f5cc64678e2b95cdd606c59a11b5
treed893f07b0040d36dfc60040dc695384e9afcf103tree | snapshot
parent  f892b7d480eec809a5dfbd6e65742b3f3155e50ecommit | diff
kbuild: add "Section mismatch" warning whitelist for powerpc

This patch fixes the following class of "Section mismatch" warnings when
building powerpc platforms.

WARNING: arch/powerpc/kernel/built-in.o - Section mismatch: reference to 
.init.data:.got2 from prom_entry (offset 0x0)
WARNING: arch/powerpc/platforms/built-in.o - Section mismatch: reference 
to .init.text:mpc8313_rdb_probe from .machine.desc after 
'mach_mpc8313_rdb' (at offset 0x4)



Signed-off-by: Li Yang <[EMAIL PROTECTED]>
Signed-off-by: Sam Ravnborg <[EMAIL PROTECTED]>

scripts/mod/modpost.c

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-30 Thread Peter Williams

Siddha, Suresh B wrote:

On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:

Siddha, Suresh B wrote:

I can try 32-bit kernel to check.

Don't bother.  I just checked 2.6.22-rc3 and the problem is not present
which means something between rc2 and rc3 has fixed the problem.  I hate
it when problems (appear to) fix themselves as it usually means they're
just hiding.

I didn't see any patches between rc2 and rc3 that were likely to have
fixed this (but doesn't mean there wasn't one).  I'm wondering whether I
should do a git bisect to see if I can find where it got fixed?

Could you see if you can reproduce it on 2.6.22-rc2?


No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
system at office. 15 attempts didn't show the issue.

Sure that nothing changed in your test setup?



I just rechecked with an old kernel and the problem was still there.

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-30 Thread Peter Williams

Siddha, Suresh B wrote:

On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:

Siddha, Suresh B wrote:

I can try 32-bit kernel to check.

Don't bother.  I just checked 2.6.22-rc3 and the problem is not present
which means something between rc2 and rc3 has fixed the problem.  I hate
it when problems (appear to) fix themselves as it usually means they're
just hiding.

I didn't see any patches between rc2 and rc3 that were likely to have
fixed this (but doesn't mean there wasn't one).  I'm wondering whether I
should do a git bisect to see if I can find where it got fixed?

Could you see if you can reproduce it on 2.6.22-rc2?


No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
system at office. 15 attempts didn't show the issue.

Sure that nothing changed in your test setup?

More experiments tomorrow morning..


I've finished bisecting and the patch at which things appear to improve 
is cd5477911fc9f5cc64678e2b95cdd606c59a11b5 which is in the middle of a 
bunch of patches reorganizing the link phase of the build.  Patch 
description is:


kbuild: add Section mismatch warning whitelist for powerpc
author  Li Yang [EMAIL PROTECTED]
Mon, 14 May 2007 10:04:28 + (18:04 +0800)
committer   Sam Ravnborg [EMAIL PROTECTED]
Sat, 19 May 2007 07:11:57 + (09:11 +0200)
commit  cd5477911fc9f5cc64678e2b95cdd606c59a11b5
treed893f07b0040d36dfc60040dc695384e9afcf103tree | snapshot
parent  f892b7d480eec809a5dfbd6e65742b3f3155e50ecommit | diff
kbuild: add Section mismatch warning whitelist for powerpc

This patch fixes the following class of Section mismatch warnings when
building powerpc platforms.

WARNING: arch/powerpc/kernel/built-in.o - Section mismatch: reference to 
.init.data:.got2 from prom_entry (offset 0x0)
WARNING: arch/powerpc/platforms/built-in.o - Section mismatch: reference 
to .init.text:mpc8313_rdb_probe from .machine.desc after 
'mach_mpc8313_rdb' (at offset 0x4)



Signed-off-by: Li Yang [EMAIL PROTECTED]
Signed-off-by: Sam Ravnborg [EMAIL PROTECTED]

scripts/mod/modpost.c

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-30 Thread Peter Williams

Siddha, Suresh B wrote:

On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:

Siddha, Suresh B wrote:

I can try 32-bit kernel to check.

Don't bother.  I just checked 2.6.22-rc3 and the problem is not present
which means something between rc2 and rc3 has fixed the problem.  I hate
it when problems (appear to) fix themselves as it usually means they're
just hiding.

I didn't see any patches between rc2 and rc3 that were likely to have
fixed this (but doesn't mean there wasn't one).  I'm wondering whether I
should do a git bisect to see if I can find where it got fixed?

Could you see if you can reproduce it on 2.6.22-rc2?


No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
system at office. 15 attempts didn't show the issue.

Sure that nothing changed in your test setup?



I just rechecked with an old kernel and the problem was still there.

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-29 Thread Peter Williams

William Lee Irwin III wrote:

On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
So what you're saying is that you think dynamic priority (or its 
equivalent) should be used for load balancing instead of static priority?


It doesn't do much in other schemes, but when fairness is directly
measured by the dynamic priority, it is a priori more meaningful.
This is not to say the net effect of using it is so different.


I suspect that while it's probably theoretically better it wouldn't make 
much difference on a real system (probably not enough to justify any 
extra complexity if there were any).  The exception might be on systems 
where there were lots of CPU intensive tasks that used relatively large 
chunks of CPU each time they were runnable which would give the load 
balancer a more stable load to try and balance.  It might be worth the 
extra effort to get it exactly right on those systems.  On most normal 
systems this isn't the case and the load balancer is always playing 
catch up to a constantly changing scenario.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-29 Thread Peter Williams

Siddha, Suresh B wrote:

On Tue, May 29, 2007 at 04:54:29PM -0700, Peter Williams wrote:

I tried with various refresh rates of top too.. Do you see the issue
at runlevel 3 too?

I haven't tried that.

Do your spinners ever relinquish the CPU voluntarily?


Nope. Simple and plain while(1); 's

I can try 32-bit kernel to check.


Don't bother.  I just checked 2.6.22-rc3 and the problem is not present 
which means something between rc2 and rc3 has fixed the problem.  I hate 
it when problems (appear to) fix themselves as it usually means they're 
just hiding.


I didn't see any patches between rc2 and rc3 that were likely to have 
fixed this (but doesn't mean there wasn't one).  I'm wondering whether I 
should do a git bisect to see if I can find where it got fixed?


Could you see if you can reproduce it on 2.6.22-rc2?

Thanks
Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-29 Thread Peter Williams

William Lee Irwin III wrote:

William Lee Irwin III wrote:

Lag should be considered in lieu of load because lag


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:

What's the definition of lag here?


Lag is the deviation of a task's allocated CPU time from the CPU time
it would be granted by the ideal fair scheduling algorithm (generalized
processor sharing; take the limit of RR with per-task timeslices
proportional to load weight as the scale factor approaches zero).


Over what time period does this operate?


Negative lag reflects receipt of excess CPU time. A close-to-canonical
"fairness metric" is the maximum of the absolute values of the lags of
all the tasks on the system. The "signed minimax pseudonorm" is the
largest lag without taking absolute values; it's a term I devised ad
hoc to describe the proposed algorithm.


So what you're saying is that you think dynamic priority (or its 
equivalent) should be used for load balancing instead of static priority?




William Lee Irwin III wrote:

is what the
scheduler is trying to minimize;


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
This isn't always the case.  Some may prefer fairness to minimal lag. 
Others may prefer particular tasks to receive preferential treatment.


This comment does not apply. Generalized processor sharing expresses
preferential treatment via weighting. Various other forms of
preferential treatment require more elaborate idealized models.


This was said before I realized that your "lag" is just a measure of 
fairness.






load is not directly relevant, but
appears to have some sort of relationship. Also, instead of pinned,
unpinned should be considered.


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
If you have total and pinned you can get unpinned.  It's probably 
cheaper to maintain data for pinned than unpinned as there's less of it 
on normal systems.


Regardless of the underlying accounting,


I was just replying to your criticism of my suggestion to keep pinned 
task statistics and use them.



I've presented a coherent
algorithm. It may be that there's no demonstrable problem to solve.
On the other hand, if there really is a question as to how to load
balance in the presence of tasks pinned to cpus, I just answered it.


Unless I missed something there's nothing in your suggestion that does 
anything more about handling pinned tasks than is already done by the 
load balancer.





William Lee Irwin III wrote:

Using the signed minimax pseudonorm (i.e. the highest
signed lag, where positive is higher than all negative regardless of
magnitude) on unpinned lags yields a rather natural load balancing
algorithm consisting of migrating from highest to lowest signed lag,
with progressively longer periods for periodic balancing across
progressively higher levels of hierarchy in sched_domains etc. as usual.
Basically skip over pinned tasks as far as lag goes.
The trick with all that comes when tasks are pinned within a set of
cpus (especially crossing sched_domains) instead of to a single cpu.


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
Yes, this makes the cost of maintaining the required data higher which 
makes keeping pinned data more attractive than unpinned.
BTW keeping data for sets of CPU affinities could cause problems as the 
number of possible sets is quite large (being 2 to the power of the 
number of CPUs).  So you need an algorithm based on pinned data for 
single CPUs that knows the pinning isn't necessarily exclusive rather 
than one based on sets of CPUs.  As I understand it (which may be 
wrong), the mechanism you describe below takes that approach.


Yes, the mechanism I described takes that approach.


William Lee Irwin III wrote:

The smpnice affair is better phrased in terms of task weighting. It's
simple to honor nice in such an arrangement. First unravel the
grouping hierarchy, then weight by nice. This looks like

[...]

In such a manner nice numbers obey the principle of least surprise.


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
Is it just me or did you stray from the topic of handling cpu affinity 
during load balancing to hierarchical load balancing?  I couldn't see 
anything in the above explanation that would improve the handling of cpu 
affinity.


There was a second issue raised to which I responded. I didn't stray
per se. I addressed a second topic in the post.


OK.

To reiterate, I don't think that my suggestion is really necessary.  I 
think that the current load balancing (stand fast a small bug that's 
being investigated) will come up with a good distribution of tasks to 
CPUs within the constraints imposed by any CPU affinity settings.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the lin

Re: [patch] CFS scheduler, -v12

2007-05-29 Thread Peter Williams

Siddha, Suresh B wrote:

On Thu, May 24, 2007 at 04:23:19PM -0700, Peter Williams wrote:

Siddha, Suresh B wrote:

On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:

Further testing indicates that CONFIG_SCHED_MC is not implicated and
it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
code in find_busiest_group() as it is common to both macros.

I think this makes the scheduling domain parameter values the most
likely cause of the problem.  I'm not very familiar with this code so
I've added those who've modified this code in the last year or
so to the
address of this e-mail.

What platform is this? I remember you mentioned its a 2 cpu box. Is it
dual core or dual package or one with HT?

It's a single CPU HT box i.e. 2 virtual CPUs.  "cat /proc/cpuinfo"
produces:


Peter, I tried on a similar box and couldn't reproduce this problem
with x86_64


Mine's a 32 bit machine.


2.6.22-rc3 kernel


I haven't tried rc3 yet.


and using defconfig(has SCHED_SMT turned on).
I am using top and just the spinners.  I don't have gkrellm running, is that
required to reproduce the issue?


Not necessarily.  But you may need to do a number of trials as sheer 
chance plays a part.




I tried number of times and also in runlevels 3,5(with top running
in a xterm incase of runlevel 5).


I've always done it in run level 5 using gnome-terminal.  I use 10 
consecutive trials without seeing the problem as an indication of its 
absence but will cut that short if I see a 3/1 which quickly recovers 
(see below).




In runlevel 5, occasionally for one refresh screen of top, I see three
spinners on one cpu and one spinner on other(with X or someother app
also on the cpu with one spinner). But it balances nicely for the
immd next refresh of the top screen.


Yes, that (the fact that it recovers quickly) confirms that the problem 
isn't present for your system.  If load balancing occurs when other 
tasks than the spinners are actually running a 1/3 split for the 
spinners is a reasonable outcome so seeing the occasional 1/3 split is 
OK but it should return to 2/2 as soon as the other tasks sleep.


When I'm doing my tests (for the various combinations of macros) I 
always count a case where I see a 3/1 split that quickly recovers as 
proof that this problem isn't present for that case and cease testing.




I tried with various refresh rates of top too.. Do you see the issue
at runlevel 3 too?


I haven't tried that.

Do your spinners ever relinquish the CPU voluntarily?

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-29 Thread Peter Williams

Siddha, Suresh B wrote:

On Thu, May 24, 2007 at 04:23:19PM -0700, Peter Williams wrote:

Siddha, Suresh B wrote:

On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:

Further testing indicates that CONFIG_SCHED_MC is not implicated and
it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
code in find_busiest_group() as it is common to both macros.

I think this makes the scheduling domain parameter values the most
likely cause of the problem.  I'm not very familiar with this code so
I've added those who've modified this code in the last year or
so to the
address of this e-mail.

What platform is this? I remember you mentioned its a 2 cpu box. Is it
dual core or dual package or one with HT?

It's a single CPU HT box i.e. 2 virtual CPUs.  cat /proc/cpuinfo
produces:


Peter, I tried on a similar box and couldn't reproduce this problem
with x86_64


Mine's a 32 bit machine.


2.6.22-rc3 kernel


I haven't tried rc3 yet.


and using defconfig(has SCHED_SMT turned on).
I am using top and just the spinners.  I don't have gkrellm running, is that
required to reproduce the issue?


Not necessarily.  But you may need to do a number of trials as sheer 
chance plays a part.




I tried number of times and also in runlevels 3,5(with top running
in a xterm incase of runlevel 5).


I've always done it in run level 5 using gnome-terminal.  I use 10 
consecutive trials without seeing the problem as an indication of its 
absence but will cut that short if I see a 3/1 which quickly recovers 
(see below).




In runlevel 5, occasionally for one refresh screen of top, I see three
spinners on one cpu and one spinner on other(with X or someother app
also on the cpu with one spinner). But it balances nicely for the
immd next refresh of the top screen.


Yes, that (the fact that it recovers quickly) confirms that the problem 
isn't present for your system.  If load balancing occurs when other 
tasks than the spinners are actually running a 1/3 split for the 
spinners is a reasonable outcome so seeing the occasional 1/3 split is 
OK but it should return to 2/2 as soon as the other tasks sleep.


When I'm doing my tests (for the various combinations of macros) I 
always count a case where I see a 3/1 split that quickly recovers as 
proof that this problem isn't present for that case and cease testing.




I tried with various refresh rates of top too.. Do you see the issue
at runlevel 3 too?


I haven't tried that.

Do your spinners ever relinquish the CPU voluntarily?

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-29 Thread Peter Williams

William Lee Irwin III wrote:

William Lee Irwin III wrote:

Lag should be considered in lieu of load because lag


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:

What's the definition of lag here?


Lag is the deviation of a task's allocated CPU time from the CPU time
it would be granted by the ideal fair scheduling algorithm (generalized
processor sharing; take the limit of RR with per-task timeslices
proportional to load weight as the scale factor approaches zero).


Over what time period does this operate?


Negative lag reflects receipt of excess CPU time. A close-to-canonical
fairness metric is the maximum of the absolute values of the lags of
all the tasks on the system. The signed minimax pseudonorm is the
largest lag without taking absolute values; it's a term I devised ad
hoc to describe the proposed algorithm.


So what you're saying is that you think dynamic priority (or its 
equivalent) should be used for load balancing instead of static priority?




William Lee Irwin III wrote:

is what the
scheduler is trying to minimize;


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
This isn't always the case.  Some may prefer fairness to minimal lag. 
Others may prefer particular tasks to receive preferential treatment.


This comment does not apply. Generalized processor sharing expresses
preferential treatment via weighting. Various other forms of
preferential treatment require more elaborate idealized models.


This was said before I realized that your lag is just a measure of 
fairness.






load is not directly relevant, but
appears to have some sort of relationship. Also, instead of pinned,
unpinned should be considered.


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
If you have total and pinned you can get unpinned.  It's probably 
cheaper to maintain data for pinned than unpinned as there's less of it 
on normal systems.


Regardless of the underlying accounting,


I was just replying to your criticism of my suggestion to keep pinned 
task statistics and use them.



I've presented a coherent
algorithm. It may be that there's no demonstrable problem to solve.
On the other hand, if there really is a question as to how to load
balance in the presence of tasks pinned to cpus, I just answered it.


Unless I missed something there's nothing in your suggestion that does 
anything more about handling pinned tasks than is already done by the 
load balancer.





William Lee Irwin III wrote:

Using the signed minimax pseudonorm (i.e. the highest
signed lag, where positive is higher than all negative regardless of
magnitude) on unpinned lags yields a rather natural load balancing
algorithm consisting of migrating from highest to lowest signed lag,
with progressively longer periods for periodic balancing across
progressively higher levels of hierarchy in sched_domains etc. as usual.
Basically skip over pinned tasks as far as lag goes.
The trick with all that comes when tasks are pinned within a set of
cpus (especially crossing sched_domains) instead of to a single cpu.


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
Yes, this makes the cost of maintaining the required data higher which 
makes keeping pinned data more attractive than unpinned.
BTW keeping data for sets of CPU affinities could cause problems as the 
number of possible sets is quite large (being 2 to the power of the 
number of CPUs).  So you need an algorithm based on pinned data for 
single CPUs that knows the pinning isn't necessarily exclusive rather 
than one based on sets of CPUs.  As I understand it (which may be 
wrong), the mechanism you describe below takes that approach.


Yes, the mechanism I described takes that approach.


William Lee Irwin III wrote:

The smpnice affair is better phrased in terms of task weighting. It's
simple to honor nice in such an arrangement. First unravel the
grouping hierarchy, then weight by nice. This looks like

[...]

In such a manner nice numbers obey the principle of least surprise.


On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
Is it just me or did you stray from the topic of handling cpu affinity 
during load balancing to hierarchical load balancing?  I couldn't see 
anything in the above explanation that would improve the handling of cpu 
affinity.


There was a second issue raised to which I responded. I didn't stray
per se. I addressed a second topic in the post.


OK.

To reiterate, I don't think that my suggestion is really necessary.  I 
think that the current load balancing (stand fast a small bug that's 
being investigated) will come up with a good distribution of tasks to 
CPUs within the constraints imposed by any CPU affinity settings.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message

Re: [patch] CFS scheduler, -v12

2007-05-29 Thread Peter Williams

Siddha, Suresh B wrote:

On Tue, May 29, 2007 at 04:54:29PM -0700, Peter Williams wrote:

I tried with various refresh rates of top too.. Do you see the issue
at runlevel 3 too?

I haven't tried that.

Do your spinners ever relinquish the CPU voluntarily?


Nope. Simple and plain while(1); 's

I can try 32-bit kernel to check.


Don't bother.  I just checked 2.6.22-rc3 and the problem is not present 
which means something between rc2 and rc3 has fixed the problem.  I hate 
it when problems (appear to) fix themselves as it usually means they're 
just hiding.


I didn't see any patches between rc2 and rc3 that were likely to have 
fixed this (but doesn't mean there wasn't one).  I'm wondering whether I 
should do a git bisect to see if I can find where it got fixed?


Could you see if you can reproduce it on 2.6.22-rc2?

Thanks
Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-29 Thread Peter Williams

William Lee Irwin III wrote:

On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
So what you're saying is that you think dynamic priority (or its 
equivalent) should be used for load balancing instead of static priority?


It doesn't do much in other schemes, but when fairness is directly
measured by the dynamic priority, it is a priori more meaningful.
This is not to say the net effect of using it is so different.


I suspect that while it's probably theoretically better it wouldn't make 
much difference on a real system (probably not enough to justify any 
extra complexity if there were any).  The exception might be on systems 
where there were lots of CPU intensive tasks that used relatively large 
chunks of CPU each time they were runnable which would give the load 
balancer a more stable load to try and balance.  It might be worth the 
extra effort to get it exactly right on those systems.  On most normal 
systems this isn't the case and the load balancer is always playing 
catch up to a constantly changing scenario.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-28 Thread Peter Williams

Peter Williams wrote:

Srivatsa Vaddagiri wrote:

On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
I don't think that ignoring cpu affinity is an option.  Setting the 
cpu affinity of tasks is a deliberate policy action on the part of 
the system administrator and has to be honoured.  


mmm ..but users can set cpu affinity w/o administrator priveleges ..



OK. So you have to assume the users know what they're doing. :-)

In reality though, the policy of allowing ordinary users to set affinity 
on their tasks should be rethought.


After more contemplation, I now think I may have gone overboard here.  I 
am now of the opinion that any degradation of overall system performance 
due to the use of cpu affinity would be confined to the tasks with cpu 
affinity set.  So there's no need to prevent ordinary users from setting 
cpu affinity on their own processes as any degradation will only affect 
them.


So it goes back to the situation where you have to assume that they know 
what they're doing and obey their policy.




In any case, there's no point having cpu affinity if it's going to be 
ignored.  Maybe you could have two levels of affinity: 1. if set by a 
root it must be obeyed; and 2. if set by an ordinary user it can be 
overridden if the best interests of the system dictate.  BUT I think 
that would be a bad idea.


This idea is now not just bad but unnecessary.

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-28 Thread Peter Williams

Srivatsa Vaddagiri wrote:

On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
I don't think that ignoring cpu affinity is an option.  Setting the cpu 
affinity of tasks is a deliberate policy action on the part of the 
system administrator and has to be honoured.  


mmm ..but users can set cpu affinity w/o administrator priveleges ..



OK. So you have to assume the users know what they're doing. :-)

In reality though, the policy of allowing ordinary users to set affinity 
on their tasks should be rethought.


In any case, there's no point having cpu affinity if it's going to be 
ignored.  Maybe you could have two levels of affinity: 1. if set by a 
root it must be obeyed; and 2. if set by an ordinary user it can be 
overridden if the best interests of the system dictate.  BUT I think 
that would be a bad idea.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-28 Thread Peter Williams

Srivatsa Vaddagiri wrote:

On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
I don't think that ignoring cpu affinity is an option.  Setting the cpu 
affinity of tasks is a deliberate policy action on the part of the 
system administrator and has to be honoured.  


mmm ..but users can set cpu affinity w/o administrator priveleges ..



OK. So you have to assume the users know what they're doing. :-)

In reality though, the policy of allowing ordinary users to set affinity 
on their tasks should be rethought.


In any case, there's no point having cpu affinity if it's going to be 
ignored.  Maybe you could have two levels of affinity: 1. if set by a 
root it must be obeyed; and 2. if set by an ordinary user it can be 
overridden if the best interests of the system dictate.  BUT I think 
that would be a bad idea.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-28 Thread Peter Williams

Peter Williams wrote:

Srivatsa Vaddagiri wrote:

On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
I don't think that ignoring cpu affinity is an option.  Setting the 
cpu affinity of tasks is a deliberate policy action on the part of 
the system administrator and has to be honoured.  


mmm ..but users can set cpu affinity w/o administrator priveleges ..



OK. So you have to assume the users know what they're doing. :-)

In reality though, the policy of allowing ordinary users to set affinity 
on their tasks should be rethought.


After more contemplation, I now think I may have gone overboard here.  I 
am now of the opinion that any degradation of overall system performance 
due to the use of cpu affinity would be confined to the tasks with cpu 
affinity set.  So there's no need to prevent ordinary users from setting 
cpu affinity on their own processes as any degradation will only affect 
them.


So it goes back to the situation where you have to assume that they know 
what they're doing and obey their policy.




In any case, there's no point having cpu affinity if it's going to be 
ignored.  Maybe you could have two levels of affinity: 1. if set by a 
root it must be obeyed; and 2. if set by an ordinary user it can be 
overridden if the best interests of the system dictate.  BUT I think 
that would be a bad idea.


This idea is now not just bad but unnecessary.

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-26 Thread Peter Williams

William Lee Irwin III wrote:

Srivatsa Vaddagiri wrote:
Ingo/Peter, any thoughts here?  CFS and smpnice probably is "broken" 
with respect to such example as above albeit for nice-based tasks.


On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
See above.  I think that faced with cpu affinity use by the system 
administrator that smpnice will tend towards a task to cpu allocation 
that is (close to) the best that can be achieved without violating the 
cpu affinity assignments.  (It may take a little longer than normal but 
it should get there eventually.)
You have to assume that the system administrator knows what (s)he's 
doing and is willing to accept the impact of their policy decision on 
the overall system performance.
Having said that, if it was deemed necessary you could probably increase 
the speed at which the load balancer converged on a good result in the 
face of cpu affinity by keeping a "pinned weighted load" value for each 
run queue and using that to modify find_busiest_group() and 
find_busiest_queue() to be a bit smarter.   But I'm not sure that it 
would be worth the added complexity.


Just in case anyone was looking for algorithms...

Lag should be considered in lieu of load because lag


What's the definition of lag here?


is what the
scheduler is trying to minimize;


This isn't always the case.  Some may prefer fairness to minimal lag. 
Others may prefer particular tasks to receive preferential treatment.



load is not directly relevant, but
appears to have some sort of relationship. Also, instead of pinned,
unpinned should be considered.


If you have total and pinned you can get unpinned.  It's probably 
cheaper to maintain data for pinned than unpinned as there's less of it 
on normal systems.



It's unpinned that load balancing can
actually migrate.


True but see previous comment.


Using the signed minimax pseudonorm (i.e. the highest
signed lag, where positive is higher than all negative regardless of
magnitude) on unpinned lags yields a rather natural load balancing
algorithm consisting of migrating from highest to lowest signed lag,
with progressively longer periods for periodic balancing across
progressively higher levels of hierarchy in sched_domains etc. as usual.
Basically skip over pinned tasks as far as lag goes.

The trick with all that comes when tasks are pinned within a set of
cpus (especially crossing sched_domains) instead of to a single cpu.


Yes, this makes the cost of maintaining the required data higher which 
makes keeping pinned data more attractive than unpinned.


BTW keeping data for sets of CPU affinities could cause problems as the 
number of possible sets is quite large (being 2 to the power of the 
number of CPUs).  So you need an algorithm based on pinned data for 
single CPUs that knows the pinning isn't necessarily exclusive rather 
than one based on sets of CPUs.  As I understand it (which may be 
wrong), the mechanism you describe below takes that approach.



There one can just consider a cpu to enter a periodic load balance
cycle, and then consider pushing and pulling, perhaps what could be
called the "exchange lags" for the pair of cpus. That would be the
minimax lag pseudonorms for the tasks migratable to both cpus of the
pair. That makes the notion of moving things from highest to lowest
lag (where load is now considered) unambiguous apart from whether all
this converges, but not when to actually try to load balance vs. when
not to, or when it's urgent vs. when it should be done periodically.

To clarify that, an O(cpus**2) notion appears to be necessary, namely
the largest exchange lag differential between any pair of cpus. There
is also the open question of whether moving tasks between cpus with the
highest exchange lag differential will actually reduce it or whether it
runs the risk of increasing it by creating a larger exchange lag
differential between different pairs of cpus. A similar open question
is raised by localizing balancing decisions to sched_domains. What
remains clear is that any such movement reduces the worst-case lag in
the whole system. Because of that, the worst-case lag in the whole
system monotonically decreases as balancing decisions are made, and
that much is subject to an infinite descent argument. Unfortunately,
determining the largest exchange lag differential appears to be more
complex than merely finding the highest and lowest lags. Bipartite
forms of the problem also arise from sched_domains.

I doubt anyone's really paying any sort of attention, so I'll not
really bother working out much more in the way of details with respect
to load balancing. It may be that there are better ways to communicate
algorithmic notions than prose descriptions. However, it's doubtful I'll
produce anything in a timely enough fashion to attract or hold interest.

The smpnice affair is better phrased in terms of task weighting. It's
simple to honor nice in such an arrangeme

Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-26 Thread Peter Williams

William Lee Irwin III wrote:

Srivatsa Vaddagiri wrote:
Ingo/Peter, any thoughts here?  CFS and smpnice probably is broken 
with respect to such example as above albeit for nice-based tasks.


On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
See above.  I think that faced with cpu affinity use by the system 
administrator that smpnice will tend towards a task to cpu allocation 
that is (close to) the best that can be achieved without violating the 
cpu affinity assignments.  (It may take a little longer than normal but 
it should get there eventually.)
You have to assume that the system administrator knows what (s)he's 
doing and is willing to accept the impact of their policy decision on 
the overall system performance.
Having said that, if it was deemed necessary you could probably increase 
the speed at which the load balancer converged on a good result in the 
face of cpu affinity by keeping a pinned weighted load value for each 
run queue and using that to modify find_busiest_group() and 
find_busiest_queue() to be a bit smarter.   But I'm not sure that it 
would be worth the added complexity.


Just in case anyone was looking for algorithms...

Lag should be considered in lieu of load because lag


What's the definition of lag here?


is what the
scheduler is trying to minimize;


This isn't always the case.  Some may prefer fairness to minimal lag. 
Others may prefer particular tasks to receive preferential treatment.



load is not directly relevant, but
appears to have some sort of relationship. Also, instead of pinned,
unpinned should be considered.


If you have total and pinned you can get unpinned.  It's probably 
cheaper to maintain data for pinned than unpinned as there's less of it 
on normal systems.



It's unpinned that load balancing can
actually migrate.


True but see previous comment.


Using the signed minimax pseudonorm (i.e. the highest
signed lag, where positive is higher than all negative regardless of
magnitude) on unpinned lags yields a rather natural load balancing
algorithm consisting of migrating from highest to lowest signed lag,
with progressively longer periods for periodic balancing across
progressively higher levels of hierarchy in sched_domains etc. as usual.
Basically skip over pinned tasks as far as lag goes.

The trick with all that comes when tasks are pinned within a set of
cpus (especially crossing sched_domains) instead of to a single cpu.


Yes, this makes the cost of maintaining the required data higher which 
makes keeping pinned data more attractive than unpinned.


BTW keeping data for sets of CPU affinities could cause problems as the 
number of possible sets is quite large (being 2 to the power of the 
number of CPUs).  So you need an algorithm based on pinned data for 
single CPUs that knows the pinning isn't necessarily exclusive rather 
than one based on sets of CPUs.  As I understand it (which may be 
wrong), the mechanism you describe below takes that approach.



There one can just consider a cpu to enter a periodic load balance
cycle, and then consider pushing and pulling, perhaps what could be
called the exchange lags for the pair of cpus. That would be the
minimax lag pseudonorms for the tasks migratable to both cpus of the
pair. That makes the notion of moving things from highest to lowest
lag (where load is now considered) unambiguous apart from whether all
this converges, but not when to actually try to load balance vs. when
not to, or when it's urgent vs. when it should be done periodically.

To clarify that, an O(cpus**2) notion appears to be necessary, namely
the largest exchange lag differential between any pair of cpus. There
is also the open question of whether moving tasks between cpus with the
highest exchange lag differential will actually reduce it or whether it
runs the risk of increasing it by creating a larger exchange lag
differential between different pairs of cpus. A similar open question
is raised by localizing balancing decisions to sched_domains. What
remains clear is that any such movement reduces the worst-case lag in
the whole system. Because of that, the worst-case lag in the whole
system monotonically decreases as balancing decisions are made, and
that much is subject to an infinite descent argument. Unfortunately,
determining the largest exchange lag differential appears to be more
complex than merely finding the highest and lowest lags. Bipartite
forms of the problem also arise from sched_domains.

I doubt anyone's really paying any sort of attention, so I'll not
really bother working out much more in the way of details with respect
to load balancing. It may be that there are better ways to communicate
algorithmic notions than prose descriptions. However, it's doubtful I'll
produce anything in a timely enough fashion to attract or hold interest.

The smpnice affair is better phrased in terms of task weighting. It's
simple to honor nice in such an arrangement. First unravel the
grouping hierarchy

Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-25 Thread Peter Williams

Srivatsa Vaddagiri wrote:

Good example :) USER2's single task will have to share its CPU with
USER1's 50 tasks (unless we modify the smpnice load balancer to
disregard cpu affinity that is - which I would not prefer to do).


I don't think that ignoring cpu affinity is an option.  Setting the cpu 
affinity of tasks is a deliberate policy action on the part of the 
system administrator and has to be honoured.  Load balancing has to do 
the best it can in these circumstances which may mean sub optimal 
distribution of the load BUT it is result of a deliberate policy 
decision by the system administrator.




Ingo/Peter, any thoughts here?  CFS and smpnice probably is "broken" 
with respect to such example as above albeit for nice-based tasks.




See above.  I think that faced with cpu affinity use by the system 
administrator that smpnice will tend towards a task to cpu allocation 
that is (close to) the best that can be achieved without violating the 
cpu affinity assignments.  (It may take a little longer than normal but 
it should get there eventually.)


You have to assume that the system administrator knows what (s)he's 
doing and is willing to accept the impact of their policy decision on 
the overall system performance.


Having said that, if it was deemed necessary you could probably increase 
the speed at which the load balancer converged on a good result in the 
face of cpu affinity by keeping a "pinned weighted load" value for each 
run queue and using that to modify find_busiest_group() and 
find_busiest_queue() to be a bit smarter.   But I'm not sure that it 
would be worth the added complexity.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-25 Thread Peter Williams

Srivatsa Vaddagiri wrote:

Good example :) USER2's single task will have to share its CPU with
USER1's 50 tasks (unless we modify the smpnice load balancer to
disregard cpu affinity that is - which I would not prefer to do).


I don't think that ignoring cpu affinity is an option.  Setting the cpu 
affinity of tasks is a deliberate policy action on the part of the 
system administrator and has to be honoured.  Load balancing has to do 
the best it can in these circumstances which may mean sub optimal 
distribution of the load BUT it is result of a deliberate policy 
decision by the system administrator.




Ingo/Peter, any thoughts here?  CFS and smpnice probably is broken 
with respect to such example as above albeit for nice-based tasks.




See above.  I think that faced with cpu affinity use by the system 
administrator that smpnice will tend towards a task to cpu allocation 
that is (close to) the best that can be achieved without violating the 
cpu affinity assignments.  (It may take a little longer than normal but 
it should get there eventually.)


You have to assume that the system administrator knows what (s)he's 
doing and is willing to accept the impact of their policy decision on 
the overall system performance.


Having said that, if it was deemed necessary you could probably increase 
the speed at which the load balancer converged on a good result in the 
face of cpu affinity by keeping a pinned weighted load value for each 
run queue and using that to modify find_busiest_group() and 
find_busiest_queue() to be a bit smarter.   But I'm not sure that it 
would be worth the added complexity.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-24 Thread Peter Williams

Siddha, Suresh B wrote:

On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:

Peter Williams wrote:
The relevant code, find_busiest_group() and find_busiest_queue(), has a 
lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
as these macros were defined in the kernels I was testing with, I built 
a kernel with these macros undefined and reran my tests.  The 
problems/anomalies were not present in 10 consecutive tests on this new 
kernel.  Even better on the few occasions that a 3/1 split did occur it 
was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
all spinners throughout each of the ten tests.


So all that is required now is an analysis of the code inside the ifdefs 
to see why it is causing a problem.


Further testing indicates that CONFIG_SCHED_MC is not implicated and
it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
code in find_busiest_group() as it is common to both macros.

I think this makes the scheduling domain parameter values the most
likely cause of the problem.  I'm not very familiar with this code so 
I've added those who've modified this code in the last year or 
so to the 
address of this e-mail.


What platform is this? I remember you mentioned its a 2 cpu box. Is it
dual core or dual package or one with HT?


It's a single CPU HT box i.e. 2 virtual CPUs.  "cat /proc/cpuinfo" produces:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 3
model name  : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping: 4
cpu MHz : 3201.145
cache size  : 1024 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
constant_tsc pni monitor ds_cpl cid xtpr

bogomips: 6403.97
clflush size: 64

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 15
model   : 3
model name  : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping: 4
cpu MHz : 3201.145
cache size  : 1024 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
constant_tsc pni monitor ds_cpl cid xtpr

bogomips: 6400.92
clflush size: 64


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-24 Thread Peter Williams

Peter Williams wrote:

Peter Williams wrote:

Peter Williams wrote:

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote:
[...]

One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.


Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..


No, and I haven't seen one.


all 4 spinners "tend" to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..


The split that I see is 3/1 and neither CPU seems to be favoured with 
respect to getting the majority.  However, top, gkrellm and X seem to 
be always on the CPU with the single spinner.  The CPU% reported by 
top is approx. 33%, 33%, 33% and 100% for the spinners.


If I renice the spinners to -10 (so that there load weights dominate 
the run queue load calculations) the problem goes away and the 
spinner to CPU allocation is 2/2 and top reports them all getting 
approx. 50% each.


For no good reason other than curiosity, I tried a variation of this 
experiment where I reniced the spinners to 10 instead of -10 and, to 
my surprise, they were allocated 2/2 to the CPUs on average.  I say on 
average because the allocations were a little more volatile and 
occasionally 0/4 splits would occur but these would last for less than 
one top cycle before the 2/2 was re-established.  The quickness of 
these recoveries would indicate that it was most likely the idle 
balance mechanism that restored the balance.


This may point the finger at the tick based load balance mechanism 
being too conservative


The relevant code, find_busiest_group() and find_busiest_queue(), has a 
lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
as these macros were defined in the kernels I was testing with, I built 
a kernel with these macros undefined and reran my tests.  The 
problems/anomalies were not present in 10 consecutive tests on this new 
kernel.  Even better on the few occasions that a 3/1 split did occur it 
was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
all spinners throughout each of the ten tests.


So all that is required now is an analysis of the code inside the ifdefs 
to see why it is causing a problem.


Further testing indicates that CONFIG_SCHED_MC is not implicated and
it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
code in find_busiest_group() as it is common to both macros.

I think this makes the scheduling domain parameter values the most
likely cause of the problem.  I'm not very familiar with this code so 
I've added those who've modified this code in the last year or so to the 
address of this e-mail.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-24 Thread Peter Williams

Peter Williams wrote:

Peter Williams wrote:

Peter Williams wrote:

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote:
[...]

One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.


Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..


No, and I haven't seen one.


all 4 spinners tend to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..


The split that I see is 3/1 and neither CPU seems to be favoured with 
respect to getting the majority.  However, top, gkrellm and X seem to 
be always on the CPU with the single spinner.  The CPU% reported by 
top is approx. 33%, 33%, 33% and 100% for the spinners.


If I renice the spinners to -10 (so that there load weights dominate 
the run queue load calculations) the problem goes away and the 
spinner to CPU allocation is 2/2 and top reports them all getting 
approx. 50% each.


For no good reason other than curiosity, I tried a variation of this 
experiment where I reniced the spinners to 10 instead of -10 and, to 
my surprise, they were allocated 2/2 to the CPUs on average.  I say on 
average because the allocations were a little more volatile and 
occasionally 0/4 splits would occur but these would last for less than 
one top cycle before the 2/2 was re-established.  The quickness of 
these recoveries would indicate that it was most likely the idle 
balance mechanism that restored the balance.


This may point the finger at the tick based load balance mechanism 
being too conservative


The relevant code, find_busiest_group() and find_busiest_queue(), has a 
lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
as these macros were defined in the kernels I was testing with, I built 
a kernel with these macros undefined and reran my tests.  The 
problems/anomalies were not present in 10 consecutive tests on this new 
kernel.  Even better on the few occasions that a 3/1 split did occur it 
was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
all spinners throughout each of the ten tests.


So all that is required now is an analysis of the code inside the ifdefs 
to see why it is causing a problem.


Further testing indicates that CONFIG_SCHED_MC is not implicated and
it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
code in find_busiest_group() as it is common to both macros.

I think this makes the scheduling domain parameter values the most
likely cause of the problem.  I'm not very familiar with this code so 
I've added those who've modified this code in the last year or so to the 
address of this e-mail.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-24 Thread Peter Williams

Siddha, Suresh B wrote:

On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:

Peter Williams wrote:
The relevant code, find_busiest_group() and find_busiest_queue(), has a 
lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
as these macros were defined in the kernels I was testing with, I built 
a kernel with these macros undefined and reran my tests.  The 
problems/anomalies were not present in 10 consecutive tests on this new 
kernel.  Even better on the few occasions that a 3/1 split did occur it 
was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
all spinners throughout each of the ten tests.


So all that is required now is an analysis of the code inside the ifdefs 
to see why it is causing a problem.


Further testing indicates that CONFIG_SCHED_MC is not implicated and
it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
code in find_busiest_group() as it is common to both macros.

I think this makes the scheduling domain parameter values the most
likely cause of the problem.  I'm not very familiar with this code so 
I've added those who've modified this code in the last year or 
so to the 
address of this e-mail.


What platform is this? I remember you mentioned its a 2 cpu box. Is it
dual core or dual package or one with HT?


It's a single CPU HT box i.e. 2 virtual CPUs.  cat /proc/cpuinfo produces:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 3
model name  : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping: 4
cpu MHz : 3201.145
cache size  : 1024 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
constant_tsc pni monitor ds_cpl cid xtpr

bogomips: 6403.97
clflush size: 64

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 15
model   : 3
model name  : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping: 4
cpu MHz : 3201.145
cache size  : 1024 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
constant_tsc pni monitor ds_cpl cid xtpr

bogomips: 6400.92
clflush size: 64


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-22 Thread Peter Williams

Dmitry Adamushko wrote:

On 22/05/07, Peter Williams <[EMAIL PROTECTED]> wrote:

> [...]
> Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..

No, and I haven't seen one.


Well, I just took one of your calculated probabilities as something
you have really observed - (*) below.

"The probabilities for the 3 split possibilities for random allocation are:

  2/2 (the desired outcome) is 3/8 likely,
  1/3 is 4/8 likely, and
  0/4 is 1/8 likely.<-- (*)
"


These are the theoretical probabilities for the outcomes based on the 
random allocation of 4 tasks to 2 CPUs.  There are, in fact, 16 
different ways that 4 tasks can be assigned to 2 CPUs.  6 of these 
result in a 2/2 split, 8 in a 1/3 split and 2 in a 0/4 split.





The split that I see is 3/1 and neither CPU seems to be favoured with
respect to getting the majority.  However, top, gkrellm and X seem to be
always on the CPU with the single spinner.  The CPU% reported by top is
approx. 33%, 33%, 33% and 100% for the spinners.


Yes. That said, idle_balance() is out of work in this case.


Which is why I reported the problem.




If I renice the spinners to -10 (so that there load weights dominate the
run queue load calculations) the problem goes away and the spinner to
CPU allocation is 2/2 and top reports them all getting approx. 50% each.


I wonder what would happen if X gets reniced to -10 instead (and
spinners are at 0).. I guess, something I described in my previous
mail (and dubbed "unlikely cospiracy" :) could happen, i.e. 0/4 and
then idle_balance() comes into play..


Probably the same as I observed but it's easier to renice the spinners.

I see the 0/4 split for brief moments if I renice the spinners to 10 
instead of -10 but the idle balancer quickly restores it to 2/2.




ok, I see. You have probably achieved a similar effect with the
spinners being reniced to 10 (but here both "X" and "top" gain
additional "weight" wrt the load balancing).


I'm playing with some jitter experiments at the moment.  The amount of
jitter needs to be small (a few tenths of a second) as the
synchronization (if it's happening) is happening at the seconds level as
the intervals for top and gkrellm will be in the 1 to 5 second range (I
guess -- I haven't checked) and the load balancing is every 60 seconds.


Hum.. the "every 60 seconds" part puzzles me quite a bit. Looking at
the run_rebalance_domain(), I'd say that it's normally overwritten by
the following code

  if (time_after(next_balance, sd->last_balance + interval))
   next_balance = sd->last_balance + interval;

the "interval" seems to be *normally* shorter than "60*HZ" (according
to the default params in topology.h).. moreover, in case of the CFS

   if (interval > HZ*NR_CPUS/10)
   interval = HZ*NR_CPUS/10;

so it can't be > 0.2 HZ in your case (== once in 200 ms at max with
HZ=1000).. am I missing something? TIA


No, I did.

But it's all academic as my synchronization theory is now dead -- see 
separate e-mail.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-22 Thread Peter Williams

Peter Williams wrote:

Peter Williams wrote:

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote:
[...]

One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.


Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..


No, and I haven't seen one.


all 4 spinners "tend" to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..


The split that I see is 3/1 and neither CPU seems to be favoured with 
respect to getting the majority.  However, top, gkrellm and X seem to 
be always on the CPU with the single spinner.  The CPU% reported by 
top is approx. 33%, 33%, 33% and 100% for the spinners.


If I renice the spinners to -10 (so that there load weights dominate 
the run queue load calculations) the problem goes away and the spinner 
to CPU allocation is 2/2 and top reports them all getting approx. 50% 
each.


For no good reason other than curiosity, I tried a variation of this 
experiment where I reniced the spinners to 10 instead of -10 and, to my 
surprise, they were allocated 2/2 to the CPUs on average.  I say on 
average because the allocations were a little more volatile and 
occasionally 0/4 splits would occur but these would last for less than 
one top cycle before the 2/2 was re-established.  The quickness of these 
recoveries would indicate that it was most likely the idle balance 
mechanism that restored the balance.


This may point the finger at the tick based load balance mechanism being 
too conservative


The relevant code, find_busiest_group() and find_busiest_queue(), has a 
lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
as these macros were defined in the kernels I was testing with, I built 
a kernel with these macros undefined and reran my tests.  The 
problems/anomalies were not present in 10 consecutive tests on this new 
kernel.  Even better on the few occasions that a 3/1 split did occur it 
was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
all spinners throughout each of the ten tests.


So all that is required now is an analysis of the code inside the ifdefs 
to see why it is causing a problem.


in when it decides whether tasks need to be moved.  In 
the case where the spinners are at nice == 0, the idle balance mechanism 
never comes into play as the 0/4 split is never seen so only the tick 
based mechanism is in force in this case and this is where the anomalies 
are seen.


This tick rebalance mechanism only situation is also true for the nice 
== -10 case but in this case the high load weights of the spinners 
overcomes the tick based load balancing mechanism's conservatism e.g. 
the difference in queue loads for a 1/3 split in this case is the 
equivalent to the difference that would be generated by an imbalance of 
about 18 nice == 0 spinners i.e. too big to be ignored.


The evidence seems to indicate that IF a rebalance operation gets 
initiated then the right amount of load will get moved.


This new evidence weakens (but does not totally destroy) my 
synchronization (a.k.a. conspiracy) theory.


My synchronization theory is now dead.

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-22 Thread Peter Williams

Dmitry Adamushko wrote:

On 22/05/07, Peter Williams [EMAIL PROTECTED] wrote:

 [...]
 Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..

No, and I haven't seen one.


Well, I just took one of your calculated probabilities as something
you have really observed - (*) below.

The probabilities for the 3 split possibilities for random allocation are:

  2/2 (the desired outcome) is 3/8 likely,
  1/3 is 4/8 likely, and
  0/4 is 1/8 likely.-- (*)



These are the theoretical probabilities for the outcomes based on the 
random allocation of 4 tasks to 2 CPUs.  There are, in fact, 16 
different ways that 4 tasks can be assigned to 2 CPUs.  6 of these 
result in a 2/2 split, 8 in a 1/3 split and 2 in a 0/4 split.





The split that I see is 3/1 and neither CPU seems to be favoured with
respect to getting the majority.  However, top, gkrellm and X seem to be
always on the CPU with the single spinner.  The CPU% reported by top is
approx. 33%, 33%, 33% and 100% for the spinners.


Yes. That said, idle_balance() is out of work in this case.


Which is why I reported the problem.




If I renice the spinners to -10 (so that there load weights dominate the
run queue load calculations) the problem goes away and the spinner to
CPU allocation is 2/2 and top reports them all getting approx. 50% each.


I wonder what would happen if X gets reniced to -10 instead (and
spinners are at 0).. I guess, something I described in my previous
mail (and dubbed unlikely cospiracy :) could happen, i.e. 0/4 and
then idle_balance() comes into play..


Probably the same as I observed but it's easier to renice the spinners.

I see the 0/4 split for brief moments if I renice the spinners to 10 
instead of -10 but the idle balancer quickly restores it to 2/2.




ok, I see. You have probably achieved a similar effect with the
spinners being reniced to 10 (but here both X and top gain
additional weight wrt the load balancing).


I'm playing with some jitter experiments at the moment.  The amount of
jitter needs to be small (a few tenths of a second) as the
synchronization (if it's happening) is happening at the seconds level as
the intervals for top and gkrellm will be in the 1 to 5 second range (I
guess -- I haven't checked) and the load balancing is every 60 seconds.


Hum.. the every 60 seconds part puzzles me quite a bit. Looking at
the run_rebalance_domain(), I'd say that it's normally overwritten by
the following code

  if (time_after(next_balance, sd-last_balance + interval))
   next_balance = sd-last_balance + interval;

the interval seems to be *normally* shorter than 60*HZ (according
to the default params in topology.h).. moreover, in case of the CFS

   if (interval  HZ*NR_CPUS/10)
   interval = HZ*NR_CPUS/10;

so it can't be  0.2 HZ in your case (== once in 200 ms at max with
HZ=1000).. am I missing something? TIA


No, I did.

But it's all academic as my synchronization theory is now dead -- see 
separate e-mail.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-22 Thread Peter Williams

Peter Williams wrote:

Peter Williams wrote:

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote:
[...]

One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.


Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..


No, and I haven't seen one.


all 4 spinners tend to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..


The split that I see is 3/1 and neither CPU seems to be favoured with 
respect to getting the majority.  However, top, gkrellm and X seem to 
be always on the CPU with the single spinner.  The CPU% reported by 
top is approx. 33%, 33%, 33% and 100% for the spinners.


If I renice the spinners to -10 (so that there load weights dominate 
the run queue load calculations) the problem goes away and the spinner 
to CPU allocation is 2/2 and top reports them all getting approx. 50% 
each.


For no good reason other than curiosity, I tried a variation of this 
experiment where I reniced the spinners to 10 instead of -10 and, to my 
surprise, they were allocated 2/2 to the CPUs on average.  I say on 
average because the allocations were a little more volatile and 
occasionally 0/4 splits would occur but these would last for less than 
one top cycle before the 2/2 was re-established.  The quickness of these 
recoveries would indicate that it was most likely the idle balance 
mechanism that restored the balance.


This may point the finger at the tick based load balance mechanism being 
too conservative


The relevant code, find_busiest_group() and find_busiest_queue(), has a 
lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
as these macros were defined in the kernels I was testing with, I built 
a kernel with these macros undefined and reran my tests.  The 
problems/anomalies were not present in 10 consecutive tests on this new 
kernel.  Even better on the few occasions that a 3/1 split did occur it 
was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
all spinners throughout each of the ten tests.


So all that is required now is an analysis of the code inside the ifdefs 
to see why it is causing a problem.


in when it decides whether tasks need to be moved.  In 
the case where the spinners are at nice == 0, the idle balance mechanism 
never comes into play as the 0/4 split is never seen so only the tick 
based mechanism is in force in this case and this is where the anomalies 
are seen.


This tick rebalance mechanism only situation is also true for the nice 
== -10 case but in this case the high load weights of the spinners 
overcomes the tick based load balancing mechanism's conservatism e.g. 
the difference in queue loads for a 1/3 split in this case is the 
equivalent to the difference that would be generated by an imbalance of 
about 18 nice == 0 spinners i.e. too big to be ignored.


The evidence seems to indicate that IF a rebalance operation gets 
initiated then the right amount of load will get moved.


This new evidence weakens (but does not totally destroy) my 
synchronization (a.k.a. conspiracy) theory.


My synchronization theory is now dead.

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-21 Thread Peter Williams

Peter Williams wrote:

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote:
[...]

One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.


Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..


No, and I haven't seen one.


all 4 spinners "tend" to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..


The split that I see is 3/1 and neither CPU seems to be favoured with 
respect to getting the majority.  However, top, gkrellm and X seem to be 
always on the CPU with the single spinner.  The CPU% reported by top is 
approx. 33%, 33%, 33% and 100% for the spinners.


If I renice the spinners to -10 (so that there load weights dominate the 
run queue load calculations) the problem goes away and the spinner to 
CPU allocation is 2/2 and top reports them all getting approx. 50% each.


For no good reason other than curiosity, I tried a variation of this 
experiment where I reniced the spinners to 10 instead of -10 and, to my 
surprise, they were allocated 2/2 to the CPUs on average.  I say on 
average because the allocations were a little more volatile and 
occasionally 0/4 splits would occur but these would last for less than 
one top cycle before the 2/2 was re-established.  The quickness of these 
recoveries would indicate that it was most likely the idle balance 
mechanism that restored the balance.


This may point the finger at the tick based load balance mechanism being 
too conservative in when it decides whether tasks need to be moved.  In 
the case where the spinners are at nice == 0, the idle balance mechanism 
never comes into play as the 0/4 split is never seen so only the tick 
based mechanism is in force in this case and this is where the anomalies 
are seen.


This tick rebalance mechanism only situation is also true for the nice 
== -10 case but in this case the high load weights of the spinners 
overcomes the tick based load balancing mechanism's conservatism e.g. 
the difference in queue loads for a 1/3 split in this case is the 
equivalent to the difference that would be generated by an imbalance of 
about 18 nice == 0 spinners i.e. too big to be ignored.


The evidence seems to indicate that IF a rebalance operation gets 
initiated then the right amount of load will get moved.


This new evidence weakens (but does not totally destroy) my 
synchronization (a.k.a. conspiracy) theory.


Peter
PS As the total load weight for 4 nice == 10 tasks is only about 40% of 
the load weight of a single nice == 0 task, the occasional 0/4 split in 
the spinners at nice == 10 case is not unexpected as it would be the 
desirable allocation if there were exactly one other running task at 
nice == 0.

--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-21 Thread Peter Williams

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote:
[...]

One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.


Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..


No, and I haven't seen one.


all 4 spinners "tend" to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..


The split that I see is 3/1 and neither CPU seems to be favoured with 
respect to getting the majority.  However, top, gkrellm and X seem to be 
always on the CPU with the single spinner.  The CPU% reported by top is 
approx. 33%, 33%, 33% and 100% for the spinners.


If I renice the spinners to -10 (so that there load weights dominate the 
run queue load calculations) the problem goes away and the spinner to 
CPU allocation is 2/2 and top reports them all getting approx. 50% each.


It's also worth noting that I've had tests where the allocation started 
out 2/2 and the system changed it to 3/1 where it stabilized.  So it's 
not just a case of bad luck with the initial CPU allocation when the 
tasks start and the load balancing failing to fix it (which was one of 
my earlier theories).




(unlikely consiparacy theory)


It's not a conspiracy.  It's just dumb luck. :-)


- idle_balance() and load_balance() (the
later is dependent on the load balancing interval which can be in
sync. with top/gkerllm activities as you suggest) move always either
top or gkerllm between themselves.. esp. if X is reniced (so it gets
additional "weight") and happens to be active (on CPU1) when
load_balance() (kicked from scheduler_tick()) runs..

p.s. it's mainly theoretical specualtions.. I recently started looking
at the load-balancing code (unfortunatelly, don't have an SMP machine
which I can upgrade to the recent kernel) and so far for me it's
mainly about getting sure I see things sanely.


I'm playing with some jitter experiments at the moment.  The amount of 
jitter needs to be small (a few tenths of a second) as the 
synchronization (if it's happening) is happening at the seconds level as 
the intervals for top and gkrellm will be in the 1 to 5 second range (I 
guess -- I haven't checked) and the load balancing is every 60 seconds.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-21 Thread Peter Williams

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote:
[...]

One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.


Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..


No, and I haven't seen one.


all 4 spinners tend to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..


The split that I see is 3/1 and neither CPU seems to be favoured with 
respect to getting the majority.  However, top, gkrellm and X seem to be 
always on the CPU with the single spinner.  The CPU% reported by top is 
approx. 33%, 33%, 33% and 100% for the spinners.


If I renice the spinners to -10 (so that there load weights dominate the 
run queue load calculations) the problem goes away and the spinner to 
CPU allocation is 2/2 and top reports them all getting approx. 50% each.


It's also worth noting that I've had tests where the allocation started 
out 2/2 and the system changed it to 3/1 where it stabilized.  So it's 
not just a case of bad luck with the initial CPU allocation when the 
tasks start and the load balancing failing to fix it (which was one of 
my earlier theories).




(unlikely consiparacy theory)


It's not a conspiracy.  It's just dumb luck. :-)


- idle_balance() and load_balance() (the
later is dependent on the load balancing interval which can be in
sync. with top/gkerllm activities as you suggest) move always either
top or gkerllm between themselves.. esp. if X is reniced (so it gets
additional weight) and happens to be active (on CPU1) when
load_balance() (kicked from scheduler_tick()) runs..

p.s. it's mainly theoretical specualtions.. I recently started looking
at the load-balancing code (unfortunatelly, don't have an SMP machine
which I can upgrade to the recent kernel) and so far for me it's
mainly about getting sure I see things sanely.


I'm playing with some jitter experiments at the moment.  The amount of 
jitter needs to be small (a few tenths of a second) as the 
synchronization (if it's happening) is happening at the seconds level as 
the intervals for top and gkrellm will be in the 1 to 5 second range (I 
guess -- I haven't checked) and the load balancing is every 60 seconds.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-21 Thread Peter Williams

Peter Williams wrote:

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote:
[...]

One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.


Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..


No, and I haven't seen one.


all 4 spinners tend to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..


The split that I see is 3/1 and neither CPU seems to be favoured with 
respect to getting the majority.  However, top, gkrellm and X seem to be 
always on the CPU with the single spinner.  The CPU% reported by top is 
approx. 33%, 33%, 33% and 100% for the spinners.


If I renice the spinners to -10 (so that there load weights dominate the 
run queue load calculations) the problem goes away and the spinner to 
CPU allocation is 2/2 and top reports them all getting approx. 50% each.


For no good reason other than curiosity, I tried a variation of this 
experiment where I reniced the spinners to 10 instead of -10 and, to my 
surprise, they were allocated 2/2 to the CPUs on average.  I say on 
average because the allocations were a little more volatile and 
occasionally 0/4 splits would occur but these would last for less than 
one top cycle before the 2/2 was re-established.  The quickness of these 
recoveries would indicate that it was most likely the idle balance 
mechanism that restored the balance.


This may point the finger at the tick based load balance mechanism being 
too conservative in when it decides whether tasks need to be moved.  In 
the case where the spinners are at nice == 0, the idle balance mechanism 
never comes into play as the 0/4 split is never seen so only the tick 
based mechanism is in force in this case and this is where the anomalies 
are seen.


This tick rebalance mechanism only situation is also true for the nice 
== -10 case but in this case the high load weights of the spinners 
overcomes the tick based load balancing mechanism's conservatism e.g. 
the difference in queue loads for a 1/3 split in this case is the 
equivalent to the difference that would be generated by an imbalance of 
about 18 nice == 0 spinners i.e. too big to be ignored.


The evidence seems to indicate that IF a rebalance operation gets 
initiated then the right amount of load will get moved.


This new evidence weakens (but does not totally destroy) my 
synchronization (a.k.a. conspiracy) theory.


Peter
PS As the total load weight for 4 nice == 10 tasks is only about 40% of 
the load weight of a single nice == 0 task, the occasional 0/4 split in 
the spinners at nice == 10 case is not unexpected as it would be the 
desirable allocation if there were exactly one other running task at 
nice == 0.

--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-19 Thread Peter Williams

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote:

[...]
One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.  A jittered load balancing
interval should break the synchronization.  This would certainly be
simpler than trying to change the move_task() logic for selecting which
tasks to move.


Just an(quick) another idea. Say, the load balancer would consider not
only p->load_weight but also something like Tw(task) =
(time_spent_on_runqueue / total_task's_runtime) * some_scale_constant
as an additional "load" component (OTOH, when a task starts, it takes
some time for this parameter to become meaningful). I guess, it could
address the scenarios your have described (but maybe break some others
as well :) ...
Any hints on why it's stupid?


Well that is the kind of thing I was hoping to avoid for the reasons of 
complexity.  I think that the actual implementation would be more 
complex than it sounds and possibly require multiple runs down the list 
of moveable tasks which would be bad for overhead.


Basically, I don't think that the problem is serious enough to warrant a 
complex solution.  But I may be wrong about how complex the 
implementation would be.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-19 Thread Peter Williams

Dmitry Adamushko wrote:

On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote:

[...]
One thing that might work is to jitter the load balancing interval a
bit.  The reason I say this is that one of the characteristics of top
and gkrellm is that they run at a more or less constant interval (and,
in this case, X would also be following this pattern as it's doing
screen updates for top and gkrellm) and this means that it's possible
for the load balancing interval to synchronize with their intervals
which in turn causes the observed problem.  A jittered load balancing
interval should break the synchronization.  This would certainly be
simpler than trying to change the move_task() logic for selecting which
tasks to move.


Just an(quick) another idea. Say, the load balancer would consider not
only p-load_weight but also something like Tw(task) =
(time_spent_on_runqueue / total_task's_runtime) * some_scale_constant
as an additional load component (OTOH, when a task starts, it takes
some time for this parameter to become meaningful). I guess, it could
address the scenarios your have described (but maybe break some others
as well :) ...
Any hints on why it's stupid?


Well that is the kind of thing I was hoping to avoid for the reasons of 
complexity.  I think that the actual implementation would be more 
complex than it sounds and possibly require multiple runs down the list 
of moveable tasks which would be bad for overhead.


Basically, I don't think that the problem is serious enough to warrant a 
complex solution.  But I may be wrong about how complex the 
implementation would be.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-18 Thread Peter Williams

Peter Williams wrote:

Ingo Molnar wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:

I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
with and without CFS; and the problem is always present.  It's not 
"nice" related as the all four tasks are run at nice == 0.


could you try -v13 and did this behavior get better in any way?


It's still there but I've got a theory about what the problems is that 
is supported by some other tests I've done.


What I'd forgotten is that I had gkrellm running as well as top (to 
observe which CPU tasks were on) at the same time as the spinners were 
running.  This meant that between them top, gkrellm and X were using 
about 2% of the CPU -- not much but enough to make it possible that at 
least one of them was running when the load balancer was trying to do 
its thing.


This raises two possibilities: 1. the system looked balanced and 2. the 
system didn't look balanced but one of  top, gkrellm or X was moved 
instead of one of the spinners.


If it's 1 then there's not much we can do about it except say that it 
only happens in these strange circumstances.  If it's 2 then we may have 
to modify the way move_tasks() selects which tasks to move (if we think 
that the circumstances warrant it -- I'm not sure that this is the case).


To examine these possibilities I tried two variations of the test.

a. run the spinners at nice == -10 instead of nice == 0.  When I did 
this the load balancing was perfect on 10 consecutive runs which 
according to my calculations makes it 99.997 certain that this 
didn't happen by chance.  This supports theory 2 above.


b. run the tests without gkrellm running but use nice == 0 for the 
spinners.  When I did this the load balancing was mostly perfect but was 
quite volatile (switching between a 2/2 and 1/3 allocation of spinners 
to CPUs) but the %CPU allocation was quite good with the spinners all 
getting approximately 49% of a CPU each.  This also supports theory 2 
above and gives weak support to theory 1 above.


This leaves the question of what to do about it.  Given that most CPU 
intensive tasks on a real system probably only run for a few tens of 
milliseconds it probably won't matter much on a real system except that 
a malicious user could exploit it to disrupt a system.


So my opinion is that we probably do need to do something about it but 
that it's not urgent.


One thing that might work is to jitter the load balancing interval a 
bit.  The reason I say this is that one of the characteristics of top 
and gkrellm is that they run at a more or less constant interval (and, 
in this case, X would also be following this pattern as it's doing 
screen updates for top and gkrellm) and this means that it's possible 
for the load balancing interval to synchronize with their intervals 
which in turn causes the observed problem.  A jittered load balancing 
interval should break the synchronization.  This would certainly be 
simpler than trying to change the move_task() logic for selecting which 
tasks to move.


I should have added that the reason I think this mooted synchronization 
is the cause of the problem is that I can think of no other way that 
tasks with such low activity (2% between the 3 of them) could cause the 
imbalance of the spinner to CPU allocation to be so persistent.




What do you think?



Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-18 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:

I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
with and without CFS; and the problem is always present.  It's not 
"nice" related as the all four tasks are run at nice == 0.


could you try -v13 and did this behavior get better in any way?


It's still there but I've got a theory about what the problems is that 
is supported by some other tests I've done.


What I'd forgotten is that I had gkrellm running as well as top (to 
observe which CPU tasks were on) at the same time as the spinners were 
running.  This meant that between them top, gkrellm and X were using 
about 2% of the CPU -- not much but enough to make it possible that at 
least one of them was running when the load balancer was trying to do 
its thing.


This raises two possibilities: 1. the system looked balanced and 2. the 
system didn't look balanced but one of  top, gkrellm or X was moved 
instead of one of the spinners.


If it's 1 then there's not much we can do about it except say that it 
only happens in these strange circumstances.  If it's 2 then we may have 
to modify the way move_tasks() selects which tasks to move (if we think 
that the circumstances warrant it -- I'm not sure that this is the case).


To examine these possibilities I tried two variations of the test.

a. run the spinners at nice == -10 instead of nice == 0.  When I did 
this the load balancing was perfect on 10 consecutive runs which 
according to my calculations makes it 99.997 certain that this 
didn't happen by chance.  This supports theory 2 above.


b. run the tests without gkrellm running but use nice == 0 for the 
spinners.  When I did this the load balancing was mostly perfect but was 
quite volatile (switching between a 2/2 and 1/3 allocation of spinners 
to CPUs) but the %CPU allocation was quite good with the spinners all 
getting approximately 49% of a CPU each.  This also supports theory 2 
above and gives weak support to theory 1 above.


This leaves the question of what to do about it.  Given that most CPU 
intensive tasks on a real system probably only run for a few tens of 
milliseconds it probably won't matter much on a real system except that 
a malicious user could exploit it to disrupt a system.


So my opinion is that we probably do need to do something about it but 
that it's not urgent.


One thing that might work is to jitter the load balancing interval a 
bit.  The reason I say this is that one of the characteristics of top 
and gkrellm is that they run at a more or less constant interval (and, 
in this case, X would also be following this pattern as it's doing 
screen updates for top and gkrellm) and this means that it's possible 
for the load balancing interval to synchronize with their intervals 
which in turn causes the observed problem.  A jittered load balancing 
interval should break the synchronization.  This would certainly be 
simpler than trying to change the move_task() logic for selecting which 
tasks to move.


What do you think?
Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-18 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams [EMAIL PROTECTED] wrote:

I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
with and without CFS; and the problem is always present.  It's not 
nice related as the all four tasks are run at nice == 0.


could you try -v13 and did this behavior get better in any way?


It's still there but I've got a theory about what the problems is that 
is supported by some other tests I've done.


What I'd forgotten is that I had gkrellm running as well as top (to 
observe which CPU tasks were on) at the same time as the spinners were 
running.  This meant that between them top, gkrellm and X were using 
about 2% of the CPU -- not much but enough to make it possible that at 
least one of them was running when the load balancer was trying to do 
its thing.


This raises two possibilities: 1. the system looked balanced and 2. the 
system didn't look balanced but one of  top, gkrellm or X was moved 
instead of one of the spinners.


If it's 1 then there's not much we can do about it except say that it 
only happens in these strange circumstances.  If it's 2 then we may have 
to modify the way move_tasks() selects which tasks to move (if we think 
that the circumstances warrant it -- I'm not sure that this is the case).


To examine these possibilities I tried two variations of the test.

a. run the spinners at nice == -10 instead of nice == 0.  When I did 
this the load balancing was perfect on 10 consecutive runs which 
according to my calculations makes it 99.997 certain that this 
didn't happen by chance.  This supports theory 2 above.


b. run the tests without gkrellm running but use nice == 0 for the 
spinners.  When I did this the load balancing was mostly perfect but was 
quite volatile (switching between a 2/2 and 1/3 allocation of spinners 
to CPUs) but the %CPU allocation was quite good with the spinners all 
getting approximately 49% of a CPU each.  This also supports theory 2 
above and gives weak support to theory 1 above.


This leaves the question of what to do about it.  Given that most CPU 
intensive tasks on a real system probably only run for a few tens of 
milliseconds it probably won't matter much on a real system except that 
a malicious user could exploit it to disrupt a system.


So my opinion is that we probably do need to do something about it but 
that it's not urgent.


One thing that might work is to jitter the load balancing interval a 
bit.  The reason I say this is that one of the characteristics of top 
and gkrellm is that they run at a more or less constant interval (and, 
in this case, X would also be following this pattern as it's doing 
screen updates for top and gkrellm) and this means that it's possible 
for the load balancing interval to synchronize with their intervals 
which in turn causes the observed problem.  A jittered load balancing 
interval should break the synchronization.  This would certainly be 
simpler than trying to change the move_task() logic for selecting which 
tasks to move.


What do you think?
Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-18 Thread Peter Williams

Peter Williams wrote:

Ingo Molnar wrote:

* Peter Williams [EMAIL PROTECTED] wrote:

I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
with and without CFS; and the problem is always present.  It's not 
nice related as the all four tasks are run at nice == 0.


could you try -v13 and did this behavior get better in any way?


It's still there but I've got a theory about what the problems is that 
is supported by some other tests I've done.


What I'd forgotten is that I had gkrellm running as well as top (to 
observe which CPU tasks were on) at the same time as the spinners were 
running.  This meant that between them top, gkrellm and X were using 
about 2% of the CPU -- not much but enough to make it possible that at 
least one of them was running when the load balancer was trying to do 
its thing.


This raises two possibilities: 1. the system looked balanced and 2. the 
system didn't look balanced but one of  top, gkrellm or X was moved 
instead of one of the spinners.


If it's 1 then there's not much we can do about it except say that it 
only happens in these strange circumstances.  If it's 2 then we may have 
to modify the way move_tasks() selects which tasks to move (if we think 
that the circumstances warrant it -- I'm not sure that this is the case).


To examine these possibilities I tried two variations of the test.

a. run the spinners at nice == -10 instead of nice == 0.  When I did 
this the load balancing was perfect on 10 consecutive runs which 
according to my calculations makes it 99.997 certain that this 
didn't happen by chance.  This supports theory 2 above.


b. run the tests without gkrellm running but use nice == 0 for the 
spinners.  When I did this the load balancing was mostly perfect but was 
quite volatile (switching between a 2/2 and 1/3 allocation of spinners 
to CPUs) but the %CPU allocation was quite good with the spinners all 
getting approximately 49% of a CPU each.  This also supports theory 2 
above and gives weak support to theory 1 above.


This leaves the question of what to do about it.  Given that most CPU 
intensive tasks on a real system probably only run for a few tens of 
milliseconds it probably won't matter much on a real system except that 
a malicious user could exploit it to disrupt a system.


So my opinion is that we probably do need to do something about it but 
that it's not urgent.


One thing that might work is to jitter the load balancing interval a 
bit.  The reason I say this is that one of the characteristics of top 
and gkrellm is that they run at a more or less constant interval (and, 
in this case, X would also be following this pattern as it's doing 
screen updates for top and gkrellm) and this means that it's possible 
for the load balancing interval to synchronize with their intervals 
which in turn causes the observed problem.  A jittered load balancing 
interval should break the synchronization.  This would certainly be 
simpler than trying to change the move_task() logic for selecting which 
tasks to move.


I should have added that the reason I think this mooted synchronization 
is the cause of the problem is that I can think of no other way that 
tasks with such low activity (2% between the 3 of them) could cause the 
imbalance of the spinner to CPU allocation to be so persistent.




What do you think?



Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-17 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:

Load balancing appears to be badly broken in this version.  When I 
started 4 hard spinners on my 2 CPU machine one ended up on one CPU 
and the other 3 on the other CPU and they stayed there.


could you try to debug this a bit more?


I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
with and without CFS; and the problem is always present.  It's not 
"nice" related as the all four tasks are run at nice == 0.


It's possible that this problem has been in the kernel for a while with 
out being noticed as, even with totally random allocation of tasks to 
CPUs without any (attempt to balance), there's a quite high probability 
of the desirable 2/2 split occurring.  So one needs to repeat the test 
several times to have reasonable assurance that the problem is not 
present.  I.e. this has the characteristics of an intermittent bug with 
all the debugging problems that introduces.


The probabilities for the 3 split possibilities for random allocation are:

2/2 (the desired outcome) is 3/8 likely,
1/3 is 4/8 likely, and
0/4 is 1/8 likely.

I'm pretty sure that this problem wasn't present when smpnice went into 
the kernel which is the last time I did a lot of load balance testing.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-17 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams [EMAIL PROTECTED] wrote:

Load balancing appears to be badly broken in this version.  When I 
started 4 hard spinners on my 2 CPU machine one ended up on one CPU 
and the other 3 on the other CPU and they stayed there.


could you try to debug this a bit more?


I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
with and without CFS; and the problem is always present.  It's not 
nice related as the all four tasks are run at nice == 0.


It's possible that this problem has been in the kernel for a while with 
out being noticed as, even with totally random allocation of tasks to 
CPUs without any (attempt to balance), there's a quite high probability 
of the desirable 2/2 split occurring.  So one needs to repeat the test 
several times to have reasonable assurance that the problem is not 
present.  I.e. this has the characteristics of an intermittent bug with 
all the debugging problems that introduces.


The probabilities for the 3 split possibilities for random allocation are:

2/2 (the desired outcome) is 3/8 likely,
1/3 is 4/8 likely, and
0/4 is 1/8 likely.

I'm pretty sure that this problem wasn't present when smpnice went into 
the kernel which is the last time I did a lot of load balance testing.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-16 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams <[EMAIL PROTECTED]> wrote:

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome,
Load balancing appears to be badly broken in this version.  When I 
started 4 hard spinners on my 2 CPU machine one ended up on one CPU 
and the other 3 on the other CPU and they stayed there.


hm, i cannot reproduce this on 4 different SMP boxen, trying various 
combinations of SCHED_SMT/MC


You may need to try more than once.  Testing load balancing can be a 
pain as there's always a possibility you'll get a good result just by 
chance.  I.e. you need a bunch of good results to say it's OK but only 
one bad result to say it's broken.   This makes testing load balancing a 
pain.


and other .config options that might make a 
difference to balancing. Could you send me your .config?


Sent separately.

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-16 Thread Peter Williams

Ingo Molnar wrote:

* Peter Williams [EMAIL PROTECTED] wrote:

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome,
Load balancing appears to be badly broken in this version.  When I 
started 4 hard spinners on my 2 CPU machine one ended up on one CPU 
and the other 3 on the other CPU and they stayed there.


hm, i cannot reproduce this on 4 different SMP boxen, trying various 
combinations of SCHED_SMT/MC


You may need to try more than once.  Testing load balancing can be a 
pain as there's always a possibility you'll get a good result just by 
chance.  I.e. you need a bunch of good results to say it's OK but only 
one bad result to say it's broken.   This makes testing load balancing a 
pain.


and other .config options that might make a 
difference to balancing. Could you send me your .config?


Sent separately.

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-15 Thread Peter Williams

Ingo Molnar wrote:

i'm pleased to announce release -v12 of the CFS scheduler patchset.

The CFS patch against v2.6.22-rc1, v2.6.21.1 or v2.6.20.10 can be 
downloaded from the usual place:
  
http://people.redhat.com/mingo/cfs-scheduler/


-v12 fixes the '3D bug' that caused trivial latencies in 3D games: it 
turns out that the problem was not resulting out of any core quality of 
CFS, it was caused by 3D userspace growing dependent on the current 
inefficiency of the vanilla scheduler's sys_sched_yield() 
implementation, and CFS's "make yield work well" changes broke it.


Even a simple 3D app like glxgears does a sys_sched_yield() for every 
frame it generates (!) on certain 3D cards, which in essence punishes 
any scheduler that implements sys_sched_yield() in a sane manner. This 
interaction of CFS's yield implementation with this user-space bug could 
be the main reason why some testers reported SD to be handling 3D games 
better than CFS. (SD uses a yield implementation similar to the vanilla 
scheduler.)


So i've added a yield workaround to -v12, which makes it work similar to 
how the vanilla scheduler and SD does it. (Xorg has been notified and 
this bug should be fixed there too. This took some time to debug because 
the 3D driver i'm using for testing does not use sys_sched_yield().) The 
workaround is activated by default so -v12 should work 'out of the box'.


Mike Galbraith has fixed a bug related to nice levels - the fix should 
make negative nice levels more potent again.


Changes since -v10:

 - nice level calculation fixes (Mike Galbraith)

 - load-balancing improvements (this should fix the SMP performance 
   problem reported by Michael Gerdau)


 - remove the sched_sleep_history_max tunable.

 - more debugging fields.

 - various cleanups, fixlets and code reorganization

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome,


Load balancing appears to be badly broken in this version.  When I 
started 4 hard spinners on my 2 CPU machine one ended up on one CPU and 
the other 3 on the other CPU and they stayed there.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v12

2007-05-15 Thread Peter Williams

Ingo Molnar wrote:

i'm pleased to announce release -v12 of the CFS scheduler patchset.

The CFS patch against v2.6.22-rc1, v2.6.21.1 or v2.6.20.10 can be 
downloaded from the usual place:
  
http://people.redhat.com/mingo/cfs-scheduler/


-v12 fixes the '3D bug' that caused trivial latencies in 3D games: it 
turns out that the problem was not resulting out of any core quality of 
CFS, it was caused by 3D userspace growing dependent on the current 
inefficiency of the vanilla scheduler's sys_sched_yield() 
implementation, and CFS's make yield work well changes broke it.


Even a simple 3D app like glxgears does a sys_sched_yield() for every 
frame it generates (!) on certain 3D cards, which in essence punishes 
any scheduler that implements sys_sched_yield() in a sane manner. This 
interaction of CFS's yield implementation with this user-space bug could 
be the main reason why some testers reported SD to be handling 3D games 
better than CFS. (SD uses a yield implementation similar to the vanilla 
scheduler.)


So i've added a yield workaround to -v12, which makes it work similar to 
how the vanilla scheduler and SD does it. (Xorg has been notified and 
this bug should be fixed there too. This took some time to debug because 
the 3D driver i'm using for testing does not use sys_sched_yield().) The 
workaround is activated by default so -v12 should work 'out of the box'.


Mike Galbraith has fixed a bug related to nice levels - the fix should 
make negative nice levels more potent again.


Changes since -v10:

 - nice level calculation fixes (Mike Galbraith)

 - load-balancing improvements (this should fix the SMP performance 
   problem reported by Michael Gerdau)


 - remove the sched_sleep_history_max tunable.

 - more debugging fields.

 - various cleanups, fixlets and code reorganization

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome,


Load balancing appears to be badly broken in this version.  When I 
started 4 hard spinners on my 2 CPU machine one ended up on one CPU and 
the other 3 on the other CPU and they stayed there.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v8

2007-05-08 Thread Peter Williams

Esben Nielsen wrote:



On Tue, 8 May 2007, Peter Williams wrote:


Esben Nielsen wrote:



 On Sun, 6 May 2007, Linus Torvalds wrote:

> > >  On Sun, 6 May 2007, Ingo Molnar wrote:
> > > >  * Linus Torvalds <[EMAIL PROTECTED]> wrote:
> > > > >  So the _only_ valid way to handle timers is to
> > >  - either not allow wrapping at all (in which case "unsigned" 
is > > >  better,

> > > since it is bigger)
> > >   - or use wrapping explicitly, and use unsigned arithmetic 
(which is

> > > well-defined in C) and do something like "(long)(a-b) > 0".
> > > >  hm, there is a corner-case in CFS where a fix like this is 
necessary.
> > > >  CFS uses 64-bit values for almost everything, and the 
majority of > >  values
> >  are of 'relative' nature with no danger of overflow. (They are 
signed

> >  because they are relative values that center around zero and can be
> >  negative or positive.)
> >  Well, I'd like to just worry about that for a while.
> >  You say there is "no danger of overflow", and I mostly agree 
that once

>  we're talking about 64-bit values, the overflow issue simply doesn't
>  exist, and furthermore the difference between 63 and 64 bits is 
not >  really
>  relevant, so there's no major reason to actively avoid signed 
entries.
> >  So in that sense, it all sounds perfectly sane. And I'm 
definitely not
>  sure your "292 years after bootup" worry is really worth even >  
considering.

>

 I would hate to tell mission control for Mankind's first mission to
 another
 star to reboot every 200 years because "there is no need to worry about
 it."

 As a matter of principle an OS should never need a reboot (with 
exception

 for upgrading). If you say you have to reboot every 200 years, why not
 every 100? Every 50?  Every 45 days (you know what I am 
referring to

 :-) ?


There's always going to be an upper limit on the representation of time.
 At least until we figure out how to implement infinity properly.


Well you need infinite memory for that :-)
But that is my point: Why go into the problem of storing absolute time 
when you can use relative time?


I'd reverse that and say "Why go to the bother of using relative time 
when you can use absolute time?".  Absolute time being time since boot, 
of course.









>  When we're really so well off that we expect the hardware and 
software
>  stack to be stable over a hundred years, I'd start to think about 
issues
>  like that, in the meantime, to me worrying about those kinds of 
issues

>  just means that you're worrying about the wrong things.
> >  BUT.
> >  There's a fundamental reason relative timestamps are difficult 
and >  almost

>  always have overflow issues: the "long long in the future" case as an
>  approximation of "infinite timeout" is almost always relevant.
> >  So rather than worry about the system staying up 292 years, I'd 
worry
>  about whether people pass in big numbers (like some MAX_S64 >  
approximation)
>  as an approximation for "infinite", and once you have things like 
that,

>  the "64 bits never overflows" argument is totally bogus.
> >  There's a damn good reason for using only *absolute* time. The 
whole
>  "signed values of relative time" may _sound_ good, but it really 
sucks >  in

>  subtle and horrible ways!
>

 I think you are wrong here. The only place you need absolute time is 
a for

 the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
 representation when you could have used a 32 bit. With a 32 bit
 implementation you are forced to handle the corner cases with wrap 
around

 and too big arguments up front. With a 64 bit you hide those problems.


As does the other method.  A 32 bit signed offset with a 32 bit base 
is just a crude version of 64 bit absolute time.


64 bit is also relative - just over a much longer period.


Yes, relative to boot.

32 bit signed offset is relative - and you know it. But with 64 people 
think it is absolute and put in large values as Linus said above.


What people?  Who gets to feed times into the scheduler?  Isn't it just 
using the time as determined by the system?


With 
32 bit future developers will know it is relative and code for it. And 
they will get their corner cases tested, because the code soon will run 
into those corners.






 I think CFS would be best off using a 32 bit timer counting in micro
 seconds. That would wrap around in 72 minuttes. But as the timers are
 relative you will never be able to specify a timer larger than 36 
minuttes
 in the future. But 36 minuttes is redicolously long for a scheduler 
and a
 simple test limiting time values to that value would not break

Re: [patch] CFS scheduler, -v8

2007-05-08 Thread Peter Williams

Esben Nielsen wrote:



On Tue, 8 May 2007, Peter Williams wrote:


Esben Nielsen wrote:



 On Sun, 6 May 2007, Linus Torvalds wrote:

On Sun, 6 May 2007, Ingo Molnar wrote:
 * Linus Torvalds [EMAIL PROTECTED] wrote:
  So the _only_ valid way to handle timers is to
- either not allow wrapping at all (in which case unsigned 
is better,

   since it is bigger)
 - or use wrapping explicitly, and use unsigned arithmetic 
(which is

   well-defined in C) and do something like (long)(a-b)  0.
 hm, there is a corner-case in CFS where a fix like this is 
necessary.
 CFS uses 64-bit values for almost everything, and the 
majority ofvalues
   are of 'relative' nature with no danger of overflow. (They are 
signed

   because they are relative values that center around zero and can be
   negative or positive.)
   Well, I'd like to just worry about that for a while.
   You say there is no danger of overflow, and I mostly agree 
that once

  we're talking about 64-bit values, the overflow issue simply doesn't
  exist, and furthermore the difference between 63 and 64 bits is 
not   really
  relevant, so there's no major reason to actively avoid signed 
entries.
   So in that sense, it all sounds perfectly sane. And I'm 
definitely not
  sure your 292 years after bootup worry is really worth even   
considering.



 I would hate to tell mission control for Mankind's first mission to
 another
 star to reboot every 200 years because there is no need to worry about
 it.

 As a matter of principle an OS should never need a reboot (with 
exception

 for upgrading). If you say you have to reboot every 200 years, why not
 every 100? Every 50?  Every 45 days (you know what I am 
referring to

 :-) ?


There's always going to be an upper limit on the representation of time.
 At least until we figure out how to implement infinity properly.


Well you need infinite memory for that :-)
But that is my point: Why go into the problem of storing absolute time 
when you can use relative time?


I'd reverse that and say Why go to the bother of using relative time 
when you can use absolute time?.  Absolute time being time since boot, 
of course.









  When we're really so well off that we expect the hardware and 
software
  stack to be stable over a hundred years, I'd start to think about 
issues
  like that, in the meantime, to me worrying about those kinds of 
issues

  just means that you're worrying about the wrong things.
   BUT.
   There's a fundamental reason relative timestamps are difficult 
and   almost

  always have overflow issues: the long long in the future case as an
  approximation of infinite timeout is almost always relevant.
   So rather than worry about the system staying up 292 years, I'd 
worry
  about whether people pass in big numbers (like some MAX_S64   
approximation)
  as an approximation for infinite, and once you have things like 
that,

  the 64 bits never overflows argument is totally bogus.
   There's a damn good reason for using only *absolute* time. The 
whole
  signed values of relative time may _sound_ good, but it really 
sucks   in

  subtle and horrible ways!


 I think you are wrong here. The only place you need absolute time is 
a for

 the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
 representation when you could have used a 32 bit. With a 32 bit
 implementation you are forced to handle the corner cases with wrap 
around

 and too big arguments up front. With a 64 bit you hide those problems.


As does the other method.  A 32 bit signed offset with a 32 bit base 
is just a crude version of 64 bit absolute time.


64 bit is also relative - just over a much longer period.


Yes, relative to boot.

32 bit signed offset is relative - and you know it. But with 64 people 
think it is absolute and put in large values as Linus said above.


What people?  Who gets to feed times into the scheduler?  Isn't it just 
using the time as determined by the system?


With 
32 bit future developers will know it is relative and code for it. And 
they will get their corner cases tested, because the code soon will run 
into those corners.






 I think CFS would be best off using a 32 bit timer counting in micro
 seconds. That would wrap around in 72 minuttes. But as the timers are
 relative you will never be able to specify a timer larger than 36 
minuttes
 in the future. But 36 minuttes is redicolously long for a scheduler 
and a
 simple test limiting time values to that value would not break 
anything.


Except if you're measuring sleep times.  I think that you'll find lots 
of tasks sleep for more than 72 minutes.


I don't think those large values will be relavant. You can easily cut 
off sleep times at 30 min or even 1 min.


The aim is to make the code as simple as possible not add this kind of 
rubbish and 1 minute would be far too low.


But you need to detect that the 
task have indeed been sleeping 2^32+1 usec and not 1 usec. You can't do

Re: [patch] CFS scheduler, -v8

2007-05-07 Thread Peter Williams

Esben Nielsen wrote:



On Sun, 6 May 2007, Linus Torvalds wrote:




On Sun, 6 May 2007, Ingo Molnar wrote:


* Linus Torvalds <[EMAIL PROTECTED]> wrote:


So the _only_ valid way to handle timers is to
 - either not allow wrapping at all (in which case "unsigned" is 
better,

   since it is bigger)
 - or use wrapping explicitly, and use unsigned arithmetic (which is
   well-defined in C) and do something like "(long)(a-b) > 0".


hm, there is a corner-case in CFS where a fix like this is necessary.

CFS uses 64-bit values for almost everything, and the majority of values
are of 'relative' nature with no danger of overflow. (They are signed
because they are relative values that center around zero and can be
negative or positive.)


Well, I'd like to just worry about that for a while.

You say there is "no danger of overflow", and I mostly agree that once
we're talking about 64-bit values, the overflow issue simply doesn't
exist, and furthermore the difference between 63 and 64 bits is not 
really

relevant, so there's no major reason to actively avoid signed entries.

So in that sense, it all sounds perfectly sane. And I'm definitely not
sure your "292 years after bootup" worry is really worth even 
considering.




I would hate to tell mission control for Mankind's first mission to another
star to reboot every 200 years because "there is no need to worry about 
it."


As a matter of principle an OS should never need a reboot (with 
exception for upgrading). If you say you have to reboot every 200 years, 
why not every 100? Every 50?  Every 45 days (you know what I am 
referring to :-) ?


There's always going to be an upper limit on the representation of time. 
 At least until we figure out how to implement infinity properly.





When we're really so well off that we expect the hardware and software
stack to be stable over a hundred years, I'd start to think about issues
like that, in the meantime, to me worrying about those kinds of issues
just means that you're worrying about the wrong things.

BUT.

There's a fundamental reason relative timestamps are difficult and almost
always have overflow issues: the "long long in the future" case as an
approximation of "infinite timeout" is almost always relevant.

So rather than worry about the system staying up 292 years, I'd worry
about whether people pass in big numbers (like some MAX_S64 
approximation)

as an approximation for "infinite", and once you have things like that,
the "64 bits never overflows" argument is totally bogus.

There's a damn good reason for using only *absolute* time. The whole
"signed values of relative time" may _sound_ good, but it really sucks in
subtle and horrible ways!



I think you are wrong here. The only place you need absolute time is a 
for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
representation when you could have used a 32 bit. With a 32 bit 
implementation you are forced to handle the corner cases with wrap 
around and too big arguments up front. With a 64 bit you hide those 
problems.


As does the other method.  A 32 bit signed offset with a 32 bit base is 
just a crude version of 64 bit absolute time.




I think CFS would be best off using a 32 bit timer counting in micro 
seconds. That would wrap around in 72 minuttes. But as the timers are 
relative you will never be able to specify a timer larger than 36 
minuttes in the future. But 36 minuttes is redicolously long for a 
scheduler and a simple test limiting time values to that value would not 
break anything.


Except if you're measuring sleep times.  I think that you'll find lots 
of tasks sleep for more than 72 minutes.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] CFS scheduler, -v8

2007-05-07 Thread Peter Williams

Esben Nielsen wrote:



On Sun, 6 May 2007, Linus Torvalds wrote:




On Sun, 6 May 2007, Ingo Molnar wrote:


* Linus Torvalds [EMAIL PROTECTED] wrote:


So the _only_ valid way to handle timers is to
 - either not allow wrapping at all (in which case unsigned is 
better,

   since it is bigger)
 - or use wrapping explicitly, and use unsigned arithmetic (which is
   well-defined in C) and do something like (long)(a-b)  0.


hm, there is a corner-case in CFS where a fix like this is necessary.

CFS uses 64-bit values for almost everything, and the majority of values
are of 'relative' nature with no danger of overflow. (They are signed
because they are relative values that center around zero and can be
negative or positive.)


Well, I'd like to just worry about that for a while.

You say there is no danger of overflow, and I mostly agree that once
we're talking about 64-bit values, the overflow issue simply doesn't
exist, and furthermore the difference between 63 and 64 bits is not 
really

relevant, so there's no major reason to actively avoid signed entries.

So in that sense, it all sounds perfectly sane. And I'm definitely not
sure your 292 years after bootup worry is really worth even 
considering.




I would hate to tell mission control for Mankind's first mission to another
star to reboot every 200 years because there is no need to worry about 
it.


As a matter of principle an OS should never need a reboot (with 
exception for upgrading). If you say you have to reboot every 200 years, 
why not every 100? Every 50?  Every 45 days (you know what I am 
referring to :-) ?


There's always going to be an upper limit on the representation of time. 
 At least until we figure out how to implement infinity properly.





When we're really so well off that we expect the hardware and software
stack to be stable over a hundred years, I'd start to think about issues
like that, in the meantime, to me worrying about those kinds of issues
just means that you're worrying about the wrong things.

BUT.

There's a fundamental reason relative timestamps are difficult and almost
always have overflow issues: the long long in the future case as an
approximation of infinite timeout is almost always relevant.

So rather than worry about the system staying up 292 years, I'd worry
about whether people pass in big numbers (like some MAX_S64 
approximation)

as an approximation for infinite, and once you have things like that,
the 64 bits never overflows argument is totally bogus.

There's a damn good reason for using only *absolute* time. The whole
signed values of relative time may _sound_ good, but it really sucks in
subtle and horrible ways!



I think you are wrong here. The only place you need absolute time is a 
for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
representation when you could have used a 32 bit. With a 32 bit 
implementation you are forced to handle the corner cases with wrap 
around and too big arguments up front. With a 64 bit you hide those 
problems.


As does the other method.  A 32 bit signed offset with a 32 bit base is 
just a crude version of 64 bit absolute time.




I think CFS would be best off using a 32 bit timer counting in micro 
seconds. That would wrap around in 72 minuttes. But as the timers are 
relative you will never be able to specify a timer larger than 36 
minuttes in the future. But 36 minuttes is redicolously long for a 
scheduler and a simple test limiting time values to that value would not 
break anything.


Except if you're measuring sleep times.  I think that you'll find lots 
of tasks sleep for more than 72 minutes.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.21

2007-05-01 Thread Peter Williams
The main change in this version is to fix bugs introduced to SPA 
schedulers during modifications to handle a recent change to the 
scheduler driver interface to take account of recent changes to the load 
balancing code.


This patch also includes a patch to sis900 code to enable it to boot on 
my system for testing (patch supplied by Neil Horman).


A patch for 2.6.21 is available at:

<http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.21.patch>

and a quilt/gquilt patch series is available at:

<http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.21.patch-series.tar.gz>

Very Brief Documentation:

You can select a default scheduler at kernel build time.  If you wish to
boot with a scheduler other than the default it can be selected at boot
time by adding:

cpusched=

to the boot command line where  is one of: ingosched,
ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs
or zaphod.  If you don't change the default when you build the kernel
the default scheduler will be ingosched (which is the normal scheduler).

The scheduler in force on a running system can be determined by the
contents of:

/proc/scheduler

Control parameters for the scheduler can be read/set via files in:

/sys/cpusched//

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.21

2007-05-01 Thread Peter Williams
The main change in this version is to fix bugs introduced to SPA 
schedulers during modifications to handle a recent change to the 
scheduler driver interface to take account of recent changes to the load 
balancing code.


This patch also includes a patch to sis900 code to enable it to boot on 
my system for testing (patch supplied by Neil Horman).


A patch for 2.6.21 is available at:

http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.21.patch

and a quilt/gquilt patch series is available at:

http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.21.patch-series.tar.gz

Very Brief Documentation:

You can select a default scheduler at kernel build time.  If you wish to
boot with a scheduler other than the default it can be selected at boot
time by adding:

cpusched=scheduler

to the boot command line where scheduler is one of: ingosched,
ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs
or zaphod.  If you don't change the default when you build the kernel
the default scheduler will be ingosched (which is the normal scheduler).

The scheduler in force on a running system can be determined by the
contents of:

/proc/scheduler

Control parameters for the scheduler can be read/set via files in:

/sys/cpusched/scheduler/

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux-2.6.21 hangs during post boot initialization phase

2007-04-27 Thread Peter Williams

Neil Horman wrote:

On Sat, Apr 28, 2007 at 12:28:28AM +1000, Peter Williams wrote:

Neil Horman wrote:

On Fri, Apr 27, 2007 at 04:05:11PM +1000, Peter Williams wrote:


Damn, This is what happens when I try to do things too quickly.  I missed one
spot in my last patch where I replaced skb with rx_skb.  Its not critical, but
it should improve sis900 performance by quite a bit.  This applies on top of the
last two patches.  Sorry about that.

Thanks & Regards
Neil

Signed-off-by: Neil Horman <[EMAIL PROTECTED]>


 sis900.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

 
diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c

index 7e44939..db59dce 100644
--- a/drivers/net/sis900.c
+++ b/drivers/net/sis900.c
@@ -1790,7 +1790,7 @@ static int sis900_rx(struct net_device *net_dev)
/* give the socket buffer to upper layers */
rx_skb = sis_priv->rx_skbuff[entry];
skb_put(rx_skb, rx_size);
-   skb->protocol = eth_type_trans(rx_skb, net_dev);
+   rx_skb->protocol = eth_type_trans(rx_skb, net_dev);
netif_rx(rx_skb);
 
 			/* some network statistics */


My system also boots OK after I add this patch.  Can't tell whether it's 
improved the performance or not.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux-2.6.21 hangs during post boot initialization phase

2007-04-27 Thread Peter Williams

Neil Horman wrote:

On Fri, Apr 27, 2007 at 04:05:11PM +1000, Peter Williams wrote:

Linus Torvalds wrote:

On Fri, 27 Apr 2007, Peter Williams wrote:
The 2.6.21 kernel is hanging during the post boot phase where various 
daemons

are being started (not always the same daemon unfortunately).

This problem was not present in 2.6.21-rc7 and there is no oops or other
unusual output in the system log at the time the hang occurs.
Can you use "git bisect" to narrow it down a bit more? It's only 125 
commits, so bisecting even just three or four kernels will narrow it down 
to a handful.
As the changes became, smaller the builds became quicker :-) and after 7 
iterations we have:



author  Neil Horman <[EMAIL PROTECTED]>
Fri, 20 Apr 2007 13:54:58 + (09:54 -0400)
committer   Jeff Garzik <[EMAIL PROTECTED]>
Tue, 24 Apr 2007 16:43:07 + (12:43 -0400)
commit  b748d9e3b80dc7e6ce6bf7399f57964b99a4104c
tree887909e1f735bb444ef0e3e370f34401fa6eee02tree | snapshot
parent  d91c088b39e3c66d309938de858775bb90fd1eadcommit | diff
sis900: Allocate rx replacement buffer before rx operation

The sis900 driver appears to have a bug in which the receive routine
passes the skbuff holding the received frame to the network stack before
refilling the buffer in the rx ring.  If a new skbuff cannot be 
allocated, the

driver simply leaves a hole in the rx ring, which causes the driver to stop
receiving frames and become non-recoverable without an rmmod/insmod 
according to
reporters.  This patch reverses that order, attempting to allocate a 
replacement
buffer first, and receiving the new frame only if one can be allocated. 
 If no
skbuff can be allocated, the current skbuf in the rx ring is recycled, 
dropping

the current frame, but keeping the NIC operational.

Signed-off-by: Neil Horman <[EMAIL PROTECTED]>
Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce


This was reported to me last night, and I've posted a patch to fix it, its
available here:
http://marc.info/?l=linux-netdev=117761259222165=2

It applies on top of the previous patch, and should fix your problem.

Here's a copy of the patch

Thanks & Regards
Neil


diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c
index a6a0f09..7e44939 100644
--- a/drivers/net/sis900.c
+++ b/drivers/net/sis900.c
@@ -1754,6 +1754,7 @@ static int sis900_rx(struct net_device *net_dev)
sis_priv->rx_ring[entry].cmdsts = RX_BUF_SIZE;
} else {
struct sk_buff * skb;
+   struct sk_buff * rx_skb;
 
 			pci_unmap_single(sis_priv->pci_dev,

sis_priv->rx_ring[entry].bufptr, RX_BUF_SIZE,
@@ -1787,10 +1788,10 @@ static int sis900_rx(struct net_device *net_dev)
}
 
 			/* give the socket buffer to upper layers */

-   skb = sis_priv->rx_skbuff[entry];
-   skb_put(skb, rx_size);
-   skb->protocol = eth_type_trans(skb, net_dev);
-   netif_rx(skb);
+   rx_skb = sis_priv->rx_skbuff[entry];
+   skb_put(rx_skb, rx_size);
+   skb->protocol = eth_type_trans(rx_skb, net_dev);
+   netif_rx(rx_skb);
 
 			/* some network statistics */

if ((rx_status & BCAST) == MCAST)


This patch fixes the problem for me.

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux-2.6.21 hangs during post boot initialization phase

2007-04-27 Thread Peter Williams

Linus Torvalds wrote:


On Fri, 27 Apr 2007, Peter Williams wrote:

The 2.6.21 kernel is hanging during the post boot phase where various daemons
are being started (not always the same daemon unfortunately).

This problem was not present in 2.6.21-rc7 and there is no oops or other
unusual output in the system log at the time the hang occurs.


Can you use "git bisect" to narrow it down a bit more? It's only 125 
commits, so bisecting even just three or four kernels will narrow it down 
to a handful.


As the changes became, smaller the builds became quicker :-) and after 7 
iterations we have:



author  Neil Horman <[EMAIL PROTECTED]>
Fri, 20 Apr 2007 13:54:58 + (09:54 -0400)
committer   Jeff Garzik <[EMAIL PROTECTED]>
Tue, 24 Apr 2007 16:43:07 + (12:43 -0400)
commit  b748d9e3b80dc7e6ce6bf7399f57964b99a4104c
tree887909e1f735bb444ef0e3e370f34401fa6eee02tree | snapshot
parent  d91c088b39e3c66d309938de858775bb90fd1eadcommit | diff
sis900: Allocate rx replacement buffer before rx operation

The sis900 driver appears to have a bug in which the receive routine
passes the skbuff holding the received frame to the network stack before
refilling the buffer in the rx ring.  If a new skbuff cannot be 
allocated, the

driver simply leaves a hole in the rx ring, which causes the driver to stop
receiving frames and become non-recoverable without an rmmod/insmod 
according to
reporters.  This patch reverses that order, attempting to allocate a 
replacement
buffer first, and receiving the new frame only if one can be allocated. 
 If no
skbuff can be allocated, the current skbuf in the rx ring is recycled, 
dropping

the current frame, but keeping the NIC operational.

Signed-off-by: Neil Horman <[EMAIL PROTECTED]>
Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux-2.6.21 hangs during post boot initialization phase

2007-04-27 Thread Peter Williams

Linus Torvalds wrote:


On Fri, 27 Apr 2007, Peter Williams wrote:

The 2.6.21 kernel is hanging during the post boot phase where various daemons
are being started (not always the same daemon unfortunately).

This problem was not present in 2.6.21-rc7 and there is no oops or other
unusual output in the system log at the time the hang occurs.


Can you use git bisect to narrow it down a bit more? It's only 125 
commits, so bisecting even just three or four kernels will narrow it down 
to a handful.


As the changes became, smaller the builds became quicker :-) and after 7 
iterations we have:



author  Neil Horman [EMAIL PROTECTED]
Fri, 20 Apr 2007 13:54:58 + (09:54 -0400)
committer   Jeff Garzik [EMAIL PROTECTED]
Tue, 24 Apr 2007 16:43:07 + (12:43 -0400)
commit  b748d9e3b80dc7e6ce6bf7399f57964b99a4104c
tree887909e1f735bb444ef0e3e370f34401fa6eee02tree | snapshot
parent  d91c088b39e3c66d309938de858775bb90fd1eadcommit | diff
sis900: Allocate rx replacement buffer before rx operation

The sis900 driver appears to have a bug in which the receive routine
passes the skbuff holding the received frame to the network stack before
refilling the buffer in the rx ring.  If a new skbuff cannot be 
allocated, the

driver simply leaves a hole in the rx ring, which causes the driver to stop
receiving frames and become non-recoverable without an rmmod/insmod 
according to
reporters.  This patch reverses that order, attempting to allocate a 
replacement
buffer first, and receiving the new frame only if one can be allocated. 
 If no
skbuff can be allocated, the current skbuf in the rx ring is recycled, 
dropping

the current frame, but keeping the NIC operational.

Signed-off-by: Neil Horman [EMAIL PROTECTED]
Signed-off-by: Jeff Garzik [EMAIL PROTECTED]

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux-2.6.21 hangs during post boot initialization phase

2007-04-27 Thread Peter Williams

Neil Horman wrote:

On Fri, Apr 27, 2007 at 04:05:11PM +1000, Peter Williams wrote:

Linus Torvalds wrote:

On Fri, 27 Apr 2007, Peter Williams wrote:
The 2.6.21 kernel is hanging during the post boot phase where various 
daemons

are being started (not always the same daemon unfortunately).

This problem was not present in 2.6.21-rc7 and there is no oops or other
unusual output in the system log at the time the hang occurs.
Can you use git bisect to narrow it down a bit more? It's only 125 
commits, so bisecting even just three or four kernels will narrow it down 
to a handful.
As the changes became, smaller the builds became quicker :-) and after 7 
iterations we have:



author  Neil Horman [EMAIL PROTECTED]
Fri, 20 Apr 2007 13:54:58 + (09:54 -0400)
committer   Jeff Garzik [EMAIL PROTECTED]
Tue, 24 Apr 2007 16:43:07 + (12:43 -0400)
commit  b748d9e3b80dc7e6ce6bf7399f57964b99a4104c
tree887909e1f735bb444ef0e3e370f34401fa6eee02tree | snapshot
parent  d91c088b39e3c66d309938de858775bb90fd1eadcommit | diff
sis900: Allocate rx replacement buffer before rx operation

The sis900 driver appears to have a bug in which the receive routine
passes the skbuff holding the received frame to the network stack before
refilling the buffer in the rx ring.  If a new skbuff cannot be 
allocated, the

driver simply leaves a hole in the rx ring, which causes the driver to stop
receiving frames and become non-recoverable without an rmmod/insmod 
according to
reporters.  This patch reverses that order, attempting to allocate a 
replacement
buffer first, and receiving the new frame only if one can be allocated. 
 If no
skbuff can be allocated, the current skbuf in the rx ring is recycled, 
dropping

the current frame, but keeping the NIC operational.

Signed-off-by: Neil Horman [EMAIL PROTECTED]
Signed-off-by: Jeff Garzik [EMAIL PROTECTED]

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce


This was reported to me last night, and I've posted a patch to fix it, its
available here:
http://marc.info/?l=linux-netdevm=117761259222165w=2

It applies on top of the previous patch, and should fix your problem.

Here's a copy of the patch

Thanks  Regards
Neil


diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c
index a6a0f09..7e44939 100644
--- a/drivers/net/sis900.c
+++ b/drivers/net/sis900.c
@@ -1754,6 +1754,7 @@ static int sis900_rx(struct net_device *net_dev)
sis_priv-rx_ring[entry].cmdsts = RX_BUF_SIZE;
} else {
struct sk_buff * skb;
+   struct sk_buff * rx_skb;
 
 			pci_unmap_single(sis_priv-pci_dev,

sis_priv-rx_ring[entry].bufptr, RX_BUF_SIZE,
@@ -1787,10 +1788,10 @@ static int sis900_rx(struct net_device *net_dev)
}
 
 			/* give the socket buffer to upper layers */

-   skb = sis_priv-rx_skbuff[entry];
-   skb_put(skb, rx_size);
-   skb-protocol = eth_type_trans(skb, net_dev);
-   netif_rx(skb);
+   rx_skb = sis_priv-rx_skbuff[entry];
+   skb_put(rx_skb, rx_size);
+   skb-protocol = eth_type_trans(rx_skb, net_dev);
+   netif_rx(rx_skb);
 
 			/* some network statistics */

if ((rx_status  BCAST) == MCAST)


This patch fixes the problem for me.

Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux-2.6.21 hangs during post boot initialization phase

2007-04-27 Thread Peter Williams

Neil Horman wrote:

On Sat, Apr 28, 2007 at 12:28:28AM +1000, Peter Williams wrote:

Neil Horman wrote:

On Fri, Apr 27, 2007 at 04:05:11PM +1000, Peter Williams wrote:


Damn, This is what happens when I try to do things too quickly.  I missed one
spot in my last patch where I replaced skb with rx_skb.  Its not critical, but
it should improve sis900 performance by quite a bit.  This applies on top of the
last two patches.  Sorry about that.

Thanks  Regards
Neil

Signed-off-by: Neil Horman [EMAIL PROTECTED]


 sis900.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

 
diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c

index 7e44939..db59dce 100644
--- a/drivers/net/sis900.c
+++ b/drivers/net/sis900.c
@@ -1790,7 +1790,7 @@ static int sis900_rx(struct net_device *net_dev)
/* give the socket buffer to upper layers */
rx_skb = sis_priv-rx_skbuff[entry];
skb_put(rx_skb, rx_size);
-   skb-protocol = eth_type_trans(rx_skb, net_dev);
+   rx_skb-protocol = eth_type_trans(rx_skb, net_dev);
netif_rx(rx_skb);
 
 			/* some network statistics */


My system also boots OK after I add this patch.  Can't tell whether it's 
improved the performance or not.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux-2.6.21 hangs during post boot initialization phase

2007-04-26 Thread Peter Williams

Linus Torvalds wrote:


On Fri, 27 Apr 2007, Peter Williams wrote:

The 2.6.21 kernel is hanging during the post boot phase where various daemons
are being started (not always the same daemon unfortunately).

This problem was not present in 2.6.21-rc7 and there is no oops or other
unusual output in the system log at the time the hang occurs.


Can you use "git bisect" to narrow it down a bit more? It's only 125 
commits, so bisecting even just three or four kernels will narrow it down 
to a handful


Yes.  I'm just in the process of reading up on how to do the bisecting 
now.  Should have an answer in a few hours, I guess.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux-2.6.21 hangs during post boot initialization phase

2007-04-26 Thread Peter Williams

Linus Torvalds wrote:


On Fri, 27 Apr 2007, Peter Williams wrote:

The 2.6.21 kernel is hanging during the post boot phase where various daemons
are being started (not always the same daemon unfortunately).

This problem was not present in 2.6.21-rc7 and there is no oops or other
unusual output in the system log at the time the hang occurs.


Can you use git bisect to narrow it down a bit more? It's only 125 
commits, so bisecting even just three or four kernels will narrow it down 
to a handful


Yes.  I'm just in the process of reading up on how to do the bisecting 
now.  Should have an answer in a few hours, I guess.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Peter Williams

Rogan Dawes wrote:

Chris Friesen wrote:

Rogan Dawes wrote:

I guess my point was if we somehow get to an odd number of 
nanoseconds, we'd end up with rounding errors. I'm not sure if your 
algorithm will ever allow that.


And Ingo's point was that when it takes thousands of nanoseconds for a 
single context switch, an error of half a nanosecond is down in the 
noise.


Chris


My concern was that since Ingo said that this is a closed economy, with 
a fixed sum/total, if we lose a nanosecond here and there, eventually 
we'll lose them all.


Some folks have uptimes of multiple years.

Of course, I could (very likely!) be full of it! ;-)


And won't be using the any new scheduler on these computers anyhow as 
that would involve bringing the system down to install the new kernel. :-)


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Peter Williams

Arjan van de Ven wrote:
Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 



there is actually 2 and not just 1 "X server", and they are VERY VERY
different in behavior.

Case 1: Accelerated driver

If X talks to a decent enough card it supports will with acceleration,
it will be very rare for X itself to spend any kind of significant
amount of CPU time, all the really heavy stuff is done in hardware, and
asynchronously at that. A bit of batching will greatly improve system
performance in this case.

Case 2: Unaccelerated VESA

Some drivers in X, especially the VESA and NV drivers (which are quite
common, vesa is used on all hardware without a special driver nowadays),
have no or not enough acceleration to matter for modern desktops. This
means the CPU is doing all the heavy lifting, in the X program. In this
case even a simple "move the window a bit" becomes quite a bit of a CPU
hog already.


Mine's a:

SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according 
to X's display settings tool.  Which category does that fall into?


It's not a special adapter and is just the one that came with the 
motherboard. It doesn't use much CPU unless I grab a window and wiggle 
it all over the screen or do something like "ls -lR /" in an xterm.




The cases are fundamentally different in behavior, because in the first
case, X hardly consumes the time it would get in any scheme, while in
the second case X really is CPU bound and will happily consume any CPU
time it can get.


Which still doesn't justify an elaborate "points" sharing scheme. 
Whichever way you look at that that's just another way of giving X more 
CPU bandwidth and there are simpler ways to give X more CPU if it needs 
it.  However, I think there's something seriously wrong if it needs the 
-19 nice that I've heard mentioned.  You might as well just run it as a 
real time process.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Peter Williams

Arjan van de Ven wrote:
Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 



there is actually 2 and not just 1 X server, and they are VERY VERY
different in behavior.

Case 1: Accelerated driver

If X talks to a decent enough card it supports will with acceleration,
it will be very rare for X itself to spend any kind of significant
amount of CPU time, all the really heavy stuff is done in hardware, and
asynchronously at that. A bit of batching will greatly improve system
performance in this case.

Case 2: Unaccelerated VESA

Some drivers in X, especially the VESA and NV drivers (which are quite
common, vesa is used on all hardware without a special driver nowadays),
have no or not enough acceleration to matter for modern desktops. This
means the CPU is doing all the heavy lifting, in the X program. In this
case even a simple move the window a bit becomes quite a bit of a CPU
hog already.


Mine's a:

SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according 
to X's display settings tool.  Which category does that fall into?


It's not a special adapter and is just the one that came with the 
motherboard. It doesn't use much CPU unless I grab a window and wiggle 
it all over the screen or do something like ls -lR / in an xterm.




The cases are fundamentally different in behavior, because in the first
case, X hardly consumes the time it would get in any scheme, while in
the second case X really is CPU bound and will happily consume any CPU
time it can get.


Which still doesn't justify an elaborate points sharing scheme. 
Whichever way you look at that that's just another way of giving X more 
CPU bandwidth and there are simpler ways to give X more CPU if it needs 
it.  However, I think there's something seriously wrong if it needs the 
-19 nice that I've heard mentioned.  You might as well just run it as a 
real time process.


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Peter Williams

Rogan Dawes wrote:

Chris Friesen wrote:

Rogan Dawes wrote:

I guess my point was if we somehow get to an odd number of 
nanoseconds, we'd end up with rounding errors. I'm not sure if your 
algorithm will ever allow that.


And Ingo's point was that when it takes thousands of nanoseconds for a 
single context switch, an error of half a nanosecond is down in the 
noise.


Chris


My concern was that since Ingo said that this is a closed economy, with 
a fixed sum/total, if we lose a nanosecond here and there, eventually 
we'll lose them all.


Some folks have uptimes of multiple years.

Of course, I could (very likely!) be full of it! ;-)


And won't be using the any new scheduler on these computers anyhow as 
that would involve bringing the system down to install the new kernel. :-)


Peter
--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Peter Williams

Linus Torvalds wrote:


On Mon, 23 Apr 2007, Ingo Molnar wrote:
The "give scheduler money" transaction can be both an "implicit 
transaction" (for example when writing to UNIX domain sockets or 
blocking on a pipe, etc.), or it could be an "explicit transaction": 
sched_yield_to(). This latter i've already implemented for CFS, but it's 
much less useful than the really significant implicit ones, the ones 
which will help X.


Yes. It would be wonderful to get it working automatically, so please say 
something about the implementation..


The "perfect" situation would be that when somebody goes to sleep, any 
extra points it had could be given to whoever it woke up last. Note that 
for something like X, it means that the points are 100% ephemeral: it gets 
points when a client sends it a request, but it would *lose* the points 
again when it sends the reply!


So it would only accumulate "scheduling points" while multiuple clients 
are actively waiting for it, which actually sounds like exactly the right 
thing. However, I don't really see how to do it well, especially since the 
kernel cannot actually match up the client that gave some scheduling 
points to the reply that X sends back.


There are subtle semantics with these kinds of things: especially if the 
scheduling points are only awarded when a process goes to sleep, if X is 
busy and continues to use the CPU (for another client), it wouldn't give 
any scheduling points back to clients and they really do accumulate with 
the server. Which again sounds like it would be exactly the right thing 
(both in the sense that the server that runs more gets more points, but 
also in the sense that we *only* give points at actual scheduling events).


But how do you actually *give/track* points? A simple "last woken up by 
this process" thing that triggers when it goes to sleep? It might work, 
but on the other hand, especially with more complex things (and networking 
tends to be pretty complex) the actual wakeup may be done by a software 
irq. Do we just say "it ran within the context of X, so we assume X was 
the one that caused it?" It probably would work, but we've generally tried 
very hard to avoid accessing "current" from interrupt context, including 
bh's.


Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 
various terminal emulators (e.g. xterm, gnome-terminal, etc.) when being 
used to run output intensive command line programs e.g. try "ls -lR /" 
in an xterm.  The other way (that I've noticed) X's CPU usage bandwidth 
sky rocket is when you grab a large window and wiggle it about a lot and 
hopefully this doesn't happen a lot so the problem that needs to be 
addressed is the one caused by text output on xterm and its ilk.


So I think that an elaborate scheme for distributing "points" between X 
and its clients would be overkill.  A good scheduler will make sure 
other tasks such as audio streamers get CPU when they need it with good 
responsiveness even when X takes off by giving them higher priority 
because their CPU bandwidth use is low.


The one problem that might still be apparent in these cases is the mouse 
becoming jerky while X is working like crazy to spew out text too fast 
for anyone to read.  But the only way to fix that is to give X more 
bandwidth but if it's already running at about 95% of a CPU that's 
unlikely to help.  To fix this you would probably need to modify X so 
that it knows re-rendering the cursor is more important than rendering 
text in an xterm.


In normal circumstances, the re-rendering of the mouse happens quickly 
enough for the user to experience good responsiveness because X's normal 
CPU use is low enough for it to be given high priority.


Just because the O(1) tried this model and failed doesn't mean that the 
model is bad.  O(1) was a flawed implementation of a good model.


Peter
PS Doing a kernel build in an xterm isn't an example of high enough 
output to cause a problem as (on my system) it only raises X's 
consumption from 0 to 2% to 2 to 5%.  The type of output that causes the 
problem is usually flying past too fast to read.

--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Peter Williams

Linus Torvalds wrote:


On Mon, 23 Apr 2007, Ingo Molnar wrote:
The give scheduler money transaction can be both an implicit 
transaction (for example when writing to UNIX domain sockets or 
blocking on a pipe, etc.), or it could be an explicit transaction: 
sched_yield_to(). This latter i've already implemented for CFS, but it's 
much less useful than the really significant implicit ones, the ones 
which will help X.


Yes. It would be wonderful to get it working automatically, so please say 
something about the implementation..


The perfect situation would be that when somebody goes to sleep, any 
extra points it had could be given to whoever it woke up last. Note that 
for something like X, it means that the points are 100% ephemeral: it gets 
points when a client sends it a request, but it would *lose* the points 
again when it sends the reply!


So it would only accumulate scheduling points while multiuple clients 
are actively waiting for it, which actually sounds like exactly the right 
thing. However, I don't really see how to do it well, especially since the 
kernel cannot actually match up the client that gave some scheduling 
points to the reply that X sends back.


There are subtle semantics with these kinds of things: especially if the 
scheduling points are only awarded when a process goes to sleep, if X is 
busy and continues to use the CPU (for another client), it wouldn't give 
any scheduling points back to clients and they really do accumulate with 
the server. Which again sounds like it would be exactly the right thing 
(both in the sense that the server that runs more gets more points, but 
also in the sense that we *only* give points at actual scheduling events).


But how do you actually *give/track* points? A simple last woken up by 
this process thing that triggers when it goes to sleep? It might work, 
but on the other hand, especially with more complex things (and networking 
tends to be pretty complex) the actual wakeup may be done by a software 
irq. Do we just say it ran within the context of X, so we assume X was 
the one that caused it? It probably would work, but we've generally tried 
very hard to avoid accessing current from interrupt context, including 
bh's.


Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 
various terminal emulators (e.g. xterm, gnome-terminal, etc.) when being 
used to run output intensive command line programs e.g. try ls -lR / 
in an xterm.  The other way (that I've noticed) X's CPU usage bandwidth 
sky rocket is when you grab a large window and wiggle it about a lot and 
hopefully this doesn't happen a lot so the problem that needs to be 
addressed is the one caused by text output on xterm and its ilk.


So I think that an elaborate scheme for distributing points between X 
and its clients would be overkill.  A good scheduler will make sure 
other tasks such as audio streamers get CPU when they need it with good 
responsiveness even when X takes off by giving them higher priority 
because their CPU bandwidth use is low.


The one problem that might still be apparent in these cases is the mouse 
becoming jerky while X is working like crazy to spew out text too fast 
for anyone to read.  But the only way to fix that is to give X more 
bandwidth but if it's already running at about 95% of a CPU that's 
unlikely to help.  To fix this you would probably need to modify X so 
that it knows re-rendering the cursor is more important than rendering 
text in an xterm.


In normal circumstances, the re-rendering of the mouse happens quickly 
enough for the user to experience good responsiveness because X's normal 
CPU use is low enough for it to be given high priority.


Just because the O(1) tried this model and failed doesn't mean that the 
model is bad.  O(1) was a flawed implementation of a good model.


Peter
PS Doing a kernel build in an xterm isn't an example of high enough 
output to cause a problem as (on my system) it only raises X's 
consumption from 0 to 2% to 2 to 5%.  The type of output that causes the 
problem is usually flying past too fast to read.

--
Peter Williams   [EMAIL PROTECTED]

Learning, n. The kind of ignorance distinguishing the studious.
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >