this is what you need
Hi, I wanted to check in with you, did you receive my email from last week? I want to share a proven system with you. This system allows you to try the whole thing for f.r_ee for 30 days. You can finally change your future without giving up any sensitive information in advance. I s-ig-ned up myself just a while ago and I'm already making more than in my regular nine to five job that I plan on quitting any day now. Despite of this, this is probably the best thing that ever happened to you if you take action now. Please reply if interested. Thanks, Peter
this is what you need
Hi, I wanted to check in with you, did you receive my email from last week? I want to share a proven system with you. This system allows you to try the whole thing for f.r_ee for 30 days. You can finally change your future without giving up any sensitive information in advance. I s-ig-ned up myself just a while ago and I'm already making more than in my regular nine to five job that I plan on quitting any day now. Despite of this, this is probably the best thing that ever happened to you if you take action now. Please reply if interested. Thanks, Peter
Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()
Jarek Poplawski wrote: On 16-10-2007 03:16, Peter Williams wrote: ... I'd suggest that we modify sched_rr_get_interval() to return -EINVAL (with *interval set to zero) if the target task is not SCHED_RR. That way we can save a lot of unnecessary code. I'll work on a patch. ... I like this idea! But, since this a system call maybe at least something like RFC would be nicer... We would be just modifying the code to meet that specification so a patch would be OK. Anyone who wants to comment will do so anyway :-). Sorry for too harsh words. I didn't consider them harsh. So, I can't be mistaken for a rapper yet? I'll work on it... Cheers, Jarek P. PS: Peter, for some unknown reason I don't receive your messages. If you get back some errors from my side I'd be interested to see it (alternative: jarkao2 at gmail.com). I haven't seen any bounce notifications. I've added the qmail address as a CC. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()
Jarek Poplawski wrote: On 16-10-2007 03:16, Peter Williams wrote: ... I'd suggest that we modify sched_rr_get_interval() to return -EINVAL (with *interval set to zero) if the target task is not SCHED_RR. That way we can save a lot of unnecessary code. I'll work on a patch. ... I like this idea! But, since this a system call maybe at least something like RFC would be nicer... We would be just modifying the code to meet that specification so a patch would be OK. Anyone who wants to comment will do so anyway :-). Sorry for too harsh words. I didn't consider them harsh. So, I can't be mistaken for a rapper yet? I'll work on it... Cheers, Jarek P. PS: Peter, for some unknown reason I don't receive your messages. If you get back some errors from my side I'd be interested to see it (alternative: jarkao2 at gmail.com). I haven't seen any bounce notifications. I've added the qmail address as a CC. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()
Jarek Poplawski wrote: On 13-10-2007 03:29, Peter Williams wrote: Jarek Poplawski wrote: On 12-10-2007 00:23, Peter Williams wrote: ... The reason I was going that route was for modularity (which helps when adding plugsched patches). I'll submit a revised patch for consideration. ... IMHO, it looks like modularity could suck here: +static unsigned int default_timeslice_fair(struct task_struct *p) +{ +return NS_TO_JIFFIES(sysctl_sched_min_granularity); +} If it's needed for outside and sched_fair will use something else (to avoid double conversion) this could be misleading. Shouldn't this be kind of private and return something usable for the class mainly? This is supplying data for a system call not something for internal use by the class. As far as the sched_fair class is concerned this is just a (necessary - because it's need by a system call) diversion. So, now all is clear: this is the misleading case! Why anything else than sched_fair should care about this? sched_fair doesn't care so if nothing else does why do we even have sys_sched_rr_get_interval()? Is this whole function an anachronism that can be expunged? I'm assuming that the reason it exists is that there are user space programs that use this system call. Am I correct in this assumption? Personally, I can't think of anything it would be useful for other than satisfying curiosity. Since this is for some special aim (not default for most classes, at least not for sched_fair) I'd suggest to change names: default_timeslice_fair() and .default_timeslice to something like eg.: rr_timeslice_fair() and .rr_timeslice or rr_interval_fair() and .rr_interval (maybe with this "default" before_"rr_" if necessary). On the other hand man (2) sched_rr_get_interval mentions that: "The identified process should be running under the SCHED_RR scheduling policy". Also this place seems to say about something simpler: http://www.gnu.org/software/libc/manual/html_node/Basic-Scheduling-Functions.html So, I still doubt sched_fair's "notion" of timeslices should be necessary here. As do I. Even more so now that you've shown me the man page for sched_rr_get_interval(). I'd suggest that we modify sched_rr_get_interval() to return -EINVAL (with *interval set to zero) if the target task is not SCHED_RR. That way we can save a lot of unnecessary code. I'll work on a patch. Unless you want to do it? Sorry for too harsh words. I didn't consider them harsh. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()
Jarek Poplawski wrote: On 13-10-2007 03:29, Peter Williams wrote: Jarek Poplawski wrote: On 12-10-2007 00:23, Peter Williams wrote: ... The reason I was going that route was for modularity (which helps when adding plugsched patches). I'll submit a revised patch for consideration. ... IMHO, it looks like modularity could suck here: +static unsigned int default_timeslice_fair(struct task_struct *p) +{ +return NS_TO_JIFFIES(sysctl_sched_min_granularity); +} If it's needed for outside and sched_fair will use something else (to avoid double conversion) this could be misleading. Shouldn't this be kind of private and return something usable for the class mainly? This is supplying data for a system call not something for internal use by the class. As far as the sched_fair class is concerned this is just a (necessary - because it's need by a system call) diversion. So, now all is clear: this is the misleading case! Why anything else than sched_fair should care about this? sched_fair doesn't care so if nothing else does why do we even have sys_sched_rr_get_interval()? Is this whole function an anachronism that can be expunged? I'm assuming that the reason it exists is that there are user space programs that use this system call. Am I correct in this assumption? Personally, I can't think of anything it would be useful for other than satisfying curiosity. Since this is for some special aim (not default for most classes, at least not for sched_fair) I'd suggest to change names: default_timeslice_fair() and .default_timeslice to something like eg.: rr_timeslice_fair() and .rr_timeslice or rr_interval_fair() and .rr_interval (maybe with this default before_rr_ if necessary). On the other hand man (2) sched_rr_get_interval mentions that: The identified process should be running under the SCHED_RR scheduling policy. Also this place seems to say about something simpler: http://www.gnu.org/software/libc/manual/html_node/Basic-Scheduling-Functions.html So, I still doubt sched_fair's notion of timeslices should be necessary here. As do I. Even more so now that you've shown me the man page for sched_rr_get_interval(). I'd suggest that we modify sched_rr_get_interval() to return -EINVAL (with *interval set to zero) if the target task is not SCHED_RR. That way we can save a lot of unnecessary code. I'll work on a patch. Unless you want to do it? Sorry for too harsh words. I didn't consider them harsh. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()
Jarek Poplawski wrote: On 12-10-2007 00:23, Peter Williams wrote: ... The reason I was going that route was for modularity (which helps when adding plugsched patches). I'll submit a revised patch for consideration. ... IMHO, it looks like modularity could suck here: +static unsigned int default_timeslice_fair(struct task_struct *p) +{ + return NS_TO_JIFFIES(sysctl_sched_min_granularity); +} If it's needed for outside and sched_fair will use something else (to avoid double conversion) this could be misleading. Shouldn't this be kind of private and return something usable for the class mainly? This is supplying data for a system call not something for internal use by the class. As far as the sched_fair class is concerned this is just a (necessary - because it's need by a system call) diversion. Why anything else than sched_fair should care about this? sched_fair doesn't care so if nothing else does why do we even have sys_sched_rr_get_interval()? Is this whole function an anachronism that can be expunged? I'm assuming that the reason it exists is that there are user space programs that use this system call. Am I correct in this assumption? Personally, I can't think of anything it would be useful for other than satisfying curiosity. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()
Jarek Poplawski wrote: On 12-10-2007 00:23, Peter Williams wrote: ... The reason I was going that route was for modularity (which helps when adding plugsched patches). I'll submit a revised patch for consideration. ... IMHO, it looks like modularity could suck here: +static unsigned int default_timeslice_fair(struct task_struct *p) +{ + return NS_TO_JIFFIES(sysctl_sched_min_granularity); +} If it's needed for outside and sched_fair will use something else (to avoid double conversion) this could be misleading. Shouldn't this be kind of private and return something usable for the class mainly? This is supplying data for a system call not something for internal use by the class. As far as the sched_fair class is concerned this is just a (necessary - because it's need by a system call) diversion. Why anything else than sched_fair should care about this? sched_fair doesn't care so if nothing else does why do we even have sys_sched_rr_get_interval()? Is this whole function an anachronism that can be expunged? I'm assuming that the reason it exists is that there are user space programs that use this system call. Am I correct in this assumption? Personally, I can't think of anything it would be useful for other than satisfying curiosity. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()
Dmitry Adamushko wrote: On 11/10/2007, Ingo Molnar <[EMAIL PROTECTED]> wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: -#define MIN_TIMESLICEmax(5 * HZ / 1000, 1) -#define DEF_TIMESLICE(100 * HZ / 1000) hm, this got removed by Dmitry quite some time ago. Could you please do this patch against the sched-devel git tree: here is the commit: http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a=commit;h=dd3fec36addd1bf76b05225b7e483378b80c3f9e I had also considered introducing smth like sched_class::task_timeslice() but decided it was not worth it. The reason I was going that route was for modularity (which helps when adding plugsched patches). I'll submit a revised patch for consideration. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: Rationalize sys_sched_rr_get_interval()
At the moment, static_prio_timeslice() is only used in sys_sched_rr_get_interval() and only gives the correct result for SCHED_FIFO and SCHED_RR tasks as the time slice for normal tasks is unrelated to the values returned by static_prio_timeslice(). This patch addresses this problem and in the process moves all the code associated with static_prio_timeslice() to sched_rt.c which is the only place where it now has relevance. Signed-off-by: Peter Williams <[EMAIL PROTECTED]> Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce diff -r 3df82b0661ca include/linux/sched.h --- a/include/linux/sched.h Mon Sep 03 12:06:59 2007 +1000 +++ b/include/linux/sched.h Mon Sep 03 12:06:59 2007 +1000 @@ -878,6 +878,7 @@ struct sched_class { void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); void (*task_new) (struct rq *rq, struct task_struct *p); + unsigned int (*default_timeslice) (struct task_struct *p); }; struct load_weight { diff -r 3df82b0661ca kernel/sched.c --- a/kernel/sched.c Mon Sep 03 12:06:59 2007 +1000 +++ b/kernel/sched.c Mon Sep 03 12:06:59 2007 +1000 @@ -101,16 +101,6 @@ unsigned long long __attribute__((weak)) #define NICE_0_LOAD SCHED_LOAD_SCALE #define NICE_0_SHIFT SCHED_LOAD_SHIFT -/* - * These are the 'tuning knobs' of the scheduler: - * - * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger), - * default timeslice is 100 msecs, maximum timeslice is 800 msecs. - * Timeslices get refilled after they expire. - */ -#define MIN_TIMESLICE max(5 * HZ / 1000, 1) -#define DEF_TIMESLICE (100 * HZ / 1000) - #ifdef CONFIG_SMP /* * Divide a load by a sched group cpu_power : (load / sg->__cpu_power) @@ -131,24 +121,6 @@ static inline void sg_inc_cpu_power(stru sg->reciprocal_cpu_power = reciprocal_value(sg->__cpu_power); } #endif - -#define SCALE_PRIO(x, prio) \ - max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE) - -/* - * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ] - * to time slice values: [800ms ... 100ms ... 5ms] - */ -static unsigned int static_prio_timeslice(int static_prio) -{ - if (static_prio == NICE_TO_PRIO(19)) - return 1; - - if (static_prio < NICE_TO_PRIO(0)) - return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio); - else - return SCALE_PRIO(DEF_TIMESLICE, static_prio); -} static inline int rt_policy(int policy) { @@ -4784,8 +4756,7 @@ long sys_sched_rr_get_interval(pid_t pid if (retval) goto out_unlock; - jiffies_to_timespec(p->policy == SCHED_FIFO ? -0 : static_prio_timeslice(p->static_prio), ); + jiffies_to_timespec(p->sched_class->default_timeslice(p), ); read_unlock(_lock); retval = copy_to_user(interval, , sizeof(t)) ? -EFAULT : 0; out_nounlock: diff -r 3df82b0661ca kernel/sched_fair.c --- a/kernel/sched_fair.c Mon Sep 03 12:06:59 2007 +1000 +++ b/kernel/sched_fair.c Mon Sep 03 12:06:59 2007 +1000 @@ -1159,6 +1159,11 @@ static void set_curr_task_fair(struct rq } #endif +static unsigned int default_timeslice_fair(struct task_struct *p) +{ + return NS_TO_JIFFIES(sysctl_sched_min_granularity); +} + /* * All the scheduling class methods: */ @@ -1180,6 +1185,7 @@ struct sched_class fair_sched_class __re .set_curr_task = set_curr_task_fair, .task_tick = task_tick_fair, .task_new = task_new_fair, + .default_timeslice = default_timeslice_fair, }; #ifdef CONFIG_SCHED_DEBUG diff -r 3df82b0661ca kernel/sched_idletask.c --- a/kernel/sched_idletask.c Mon Sep 03 12:06:59 2007 +1000 +++ b/kernel/sched_idletask.c Mon Sep 03 12:06:59 2007 +1000 @@ -59,6 +59,11 @@ static void task_tick_idle(struct rq *rq { } +static unsigned int default_timeslice_idle(struct task_struct *p) +{ + return 0; +} + /* * Simple, special scheduling class for the per-CPU idle tasks: */ @@ -80,4 +85,5 @@ static struct sched_class idle_sched_cla .task_tick = task_tick_idle, /* no .task_new for idle tasks */ + .default_timeslice = default_timeslice_idle, }; diff -r 3df82b0661ca kernel/sched_rt.c --- a/kernel/sched_rt.c Mon Sep 03 12:06:59 2007 +1000 +++ b/kernel/sched_rt.c Mon Sep 03 12:06:59 2007 +1000 @@ -205,6 +205,34 @@ move_one_task_rt(struct rq *this_rq, int } #endif +/* + * These are the 'tuning knobs' of the scheduler: + * + * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger), + * default timeslice is 100 msecs, maximum timeslice is 800 msecs. + * Timeslices get refilled after they expire. + */ +#define MIN_TIMESLICE max(5 * HZ / 1000, 1) +#define DEF_TIMESLICE (100 * HZ / 1000) + +#define SCALE_PRIO(x, prio) \ + max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE) + +/* + * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ] + * to time slice values: [800ms ... 100ms ... 5ms] + */ +static unsigned int static_prio_timeslice(int static_pr
[PATCH] sched: Exclude SMP code from non SMP builds
At the moment, a lot of load balancing code that is irrelevant to non SMP systems gets included during non SMP builds. This patch addresses this issue and should reduce the binary size on non SMP systems. This patch assumes that the "sched: Reduce overhead in balance_tasks()" (non urgent) patch that I sent on the 15th of August has been applied. Signed-off-by: Peter Williams <[EMAIL PROTECTED]> Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce diff -r df69cb019596 include/linux/sched.h --- a/include/linux/sched.h Thu Aug 16 12:12:18 2007 +1000 +++ b/include/linux/sched.h Fri Aug 17 13:54:28 2007 +1000 @@ -864,6 +864,7 @@ struct sched_class { struct task_struct * (*pick_next_task) (struct rq *rq); void (*put_prev_task) (struct rq *rq, struct task_struct *p); +#ifdef CONFIG_SMP unsigned long (*load_balance) (struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, @@ -872,6 +873,7 @@ struct sched_class { int (*move_one_task) (struct rq *this_rq, int this_cpu, struct rq *busiest, struct sched_domain *sd, enum cpu_idle_type idle); +#endif void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); diff -r df69cb019596 kernel/sched.c --- a/kernel/sched.c Thu Aug 16 12:12:18 2007 +1000 +++ b/kernel/sched.c Fri Aug 17 16:03:11 2007 +1000 @@ -764,23 +764,6 @@ iter_move_one_task(struct rq *this_rq, i iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, struct sched_domain *sd, enum cpu_idle_type idle, struct rq_iterator *iterator); -#else -static inline unsigned long -balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, - unsigned long max_load_move, struct sched_domain *sd, - enum cpu_idle_type idle, int *all_pinned, - int *this_best_prio, struct rq_iterator *iterator) -{ - return 0; -} - -static inline int -iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, - struct sched_domain *sd, enum cpu_idle_type idle, - struct rq_iterator *iterator) -{ - return 0; -} #endif #include "sched_stats.h" diff -r df69cb019596 kernel/sched_fair.c --- a/kernel/sched_fair.c Thu Aug 16 12:12:18 2007 +1000 +++ b/kernel/sched_fair.c Fri Aug 17 16:00:21 2007 +1000 @@ -887,6 +887,7 @@ static void put_prev_task_fair(struct rq } } +#ifdef CONFIG_SMP /** * Fair scheduling class load-balancing methods: */ @@ -1004,6 +1005,7 @@ move_one_task_fair(struct rq *this_rq, i return 0; } +#endif /* * scheduler tick hitting a task of our scheduling class: @@ -1090,8 +1092,10 @@ struct sched_class fair_sched_class __re .pick_next_task = pick_next_task_fair, .put_prev_task = put_prev_task_fair, +#ifdef CONFIG_SMP .load_balance = load_balance_fair, .move_one_task = move_one_task_fair, +#endif .set_curr_task = set_curr_task_fair, .task_tick = task_tick_fair, diff -r df69cb019596 kernel/sched_idletask.c --- a/kernel/sched_idletask.c Thu Aug 16 12:12:18 2007 +1000 +++ b/kernel/sched_idletask.c Fri Aug 17 15:58:59 2007 +1000 @@ -37,6 +37,7 @@ static void put_prev_task_idle(struct rq { } +#ifdef CONFIG_SMP static unsigned long load_balance_idle(struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, @@ -52,6 +53,7 @@ move_one_task_idle(struct rq *this_rq, i { return 0; } +#endif static void task_tick_idle(struct rq *rq, struct task_struct *curr) { @@ -71,8 +73,10 @@ static struct sched_class idle_sched_cla .pick_next_task = pick_next_task_idle, .put_prev_task = put_prev_task_idle, +#ifdef CONFIG_SMP .load_balance = load_balance_idle, .move_one_task = move_one_task_idle, +#endif .task_tick = task_tick_idle, /* no .task_new for idle tasks */ diff -r df69cb019596 kernel/sched_rt.c --- a/kernel/sched_rt.c Thu Aug 16 12:12:18 2007 +1000 +++ b/kernel/sched_rt.c Fri Aug 17 15:53:20 2007 +1000 @@ -98,6 +98,7 @@ static void put_prev_task_rt(struct rq * p->se.exec_start = 0; } +#ifdef CONFIG_SMP /* * Load-balancing iterator. Note: while the runqueue stays locked * during the whole iteration, the current task might be @@ -202,6 +203,7 @@ move_one_task_rt(struct rq *this_rq, int return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle, _rq_iterator); } +#endif static void task_tick_rt(struct rq *rq, struct task_struct *p) { @@ -232,8 +234,10 @@ static struct sched_class rt_sched_class .pick_next_task = pick_next_task_rt, .put_prev_task = put_prev_task_rt, +#ifdef CONFIG_SMP .load_balance = load_balance_rt, .move_one_task = move_one_task_rt, +#endif .task_tick = task_tick_rt, };
[PATCH] sched: Exclude SMP code from non SMP builds
At the moment, a lot of load balancing code that is irrelevant to non SMP systems gets included during non SMP builds. This patch addresses this issue and should reduce the binary size on non SMP systems. This patch assumes that the sched: Reduce overhead in balance_tasks() (non urgent) patch that I sent on the 15th of August has been applied. Signed-off-by: Peter Williams [EMAIL PROTECTED] Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce diff -r df69cb019596 include/linux/sched.h --- a/include/linux/sched.h Thu Aug 16 12:12:18 2007 +1000 +++ b/include/linux/sched.h Fri Aug 17 13:54:28 2007 +1000 @@ -864,6 +864,7 @@ struct sched_class { struct task_struct * (*pick_next_task) (struct rq *rq); void (*put_prev_task) (struct rq *rq, struct task_struct *p); +#ifdef CONFIG_SMP unsigned long (*load_balance) (struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, @@ -872,6 +873,7 @@ struct sched_class { int (*move_one_task) (struct rq *this_rq, int this_cpu, struct rq *busiest, struct sched_domain *sd, enum cpu_idle_type idle); +#endif void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); diff -r df69cb019596 kernel/sched.c --- a/kernel/sched.c Thu Aug 16 12:12:18 2007 +1000 +++ b/kernel/sched.c Fri Aug 17 16:03:11 2007 +1000 @@ -764,23 +764,6 @@ iter_move_one_task(struct rq *this_rq, i iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, struct sched_domain *sd, enum cpu_idle_type idle, struct rq_iterator *iterator); -#else -static inline unsigned long -balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, - unsigned long max_load_move, struct sched_domain *sd, - enum cpu_idle_type idle, int *all_pinned, - int *this_best_prio, struct rq_iterator *iterator) -{ - return 0; -} - -static inline int -iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, - struct sched_domain *sd, enum cpu_idle_type idle, - struct rq_iterator *iterator) -{ - return 0; -} #endif #include sched_stats.h diff -r df69cb019596 kernel/sched_fair.c --- a/kernel/sched_fair.c Thu Aug 16 12:12:18 2007 +1000 +++ b/kernel/sched_fair.c Fri Aug 17 16:00:21 2007 +1000 @@ -887,6 +887,7 @@ static void put_prev_task_fair(struct rq } } +#ifdef CONFIG_SMP /** * Fair scheduling class load-balancing methods: */ @@ -1004,6 +1005,7 @@ move_one_task_fair(struct rq *this_rq, i return 0; } +#endif /* * scheduler tick hitting a task of our scheduling class: @@ -1090,8 +1092,10 @@ struct sched_class fair_sched_class __re .pick_next_task = pick_next_task_fair, .put_prev_task = put_prev_task_fair, +#ifdef CONFIG_SMP .load_balance = load_balance_fair, .move_one_task = move_one_task_fair, +#endif .set_curr_task = set_curr_task_fair, .task_tick = task_tick_fair, diff -r df69cb019596 kernel/sched_idletask.c --- a/kernel/sched_idletask.c Thu Aug 16 12:12:18 2007 +1000 +++ b/kernel/sched_idletask.c Fri Aug 17 15:58:59 2007 +1000 @@ -37,6 +37,7 @@ static void put_prev_task_idle(struct rq { } +#ifdef CONFIG_SMP static unsigned long load_balance_idle(struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, @@ -52,6 +53,7 @@ move_one_task_idle(struct rq *this_rq, i { return 0; } +#endif static void task_tick_idle(struct rq *rq, struct task_struct *curr) { @@ -71,8 +73,10 @@ static struct sched_class idle_sched_cla .pick_next_task = pick_next_task_idle, .put_prev_task = put_prev_task_idle, +#ifdef CONFIG_SMP .load_balance = load_balance_idle, .move_one_task = move_one_task_idle, +#endif .task_tick = task_tick_idle, /* no .task_new for idle tasks */ diff -r df69cb019596 kernel/sched_rt.c --- a/kernel/sched_rt.c Thu Aug 16 12:12:18 2007 +1000 +++ b/kernel/sched_rt.c Fri Aug 17 15:53:20 2007 +1000 @@ -98,6 +98,7 @@ static void put_prev_task_rt(struct rq * p-se.exec_start = 0; } +#ifdef CONFIG_SMP /* * Load-balancing iterator. Note: while the runqueue stays locked * during the whole iteration, the current task might be @@ -202,6 +203,7 @@ move_one_task_rt(struct rq *this_rq, int return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle, rt_rq_iterator); } +#endif static void task_tick_rt(struct rq *rq, struct task_struct *p) { @@ -232,8 +234,10 @@ static struct sched_class rt_sched_class .pick_next_task = pick_next_task_rt, .put_prev_task = put_prev_task_rt, +#ifdef CONFIG_SMP .load_balance = load_balance_rt, .move_one_task = move_one_task_rt, +#endif .task_tick = task_tick_rt, };
[PATCH] sched: Rationalize sys_sched_rr_get_interval()
At the moment, static_prio_timeslice() is only used in sys_sched_rr_get_interval() and only gives the correct result for SCHED_FIFO and SCHED_RR tasks as the time slice for normal tasks is unrelated to the values returned by static_prio_timeslice(). This patch addresses this problem and in the process moves all the code associated with static_prio_timeslice() to sched_rt.c which is the only place where it now has relevance. Signed-off-by: Peter Williams [EMAIL PROTECTED] Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce diff -r 3df82b0661ca include/linux/sched.h --- a/include/linux/sched.h Mon Sep 03 12:06:59 2007 +1000 +++ b/include/linux/sched.h Mon Sep 03 12:06:59 2007 +1000 @@ -878,6 +878,7 @@ struct sched_class { void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); void (*task_new) (struct rq *rq, struct task_struct *p); + unsigned int (*default_timeslice) (struct task_struct *p); }; struct load_weight { diff -r 3df82b0661ca kernel/sched.c --- a/kernel/sched.c Mon Sep 03 12:06:59 2007 +1000 +++ b/kernel/sched.c Mon Sep 03 12:06:59 2007 +1000 @@ -101,16 +101,6 @@ unsigned long long __attribute__((weak)) #define NICE_0_LOAD SCHED_LOAD_SCALE #define NICE_0_SHIFT SCHED_LOAD_SHIFT -/* - * These are the 'tuning knobs' of the scheduler: - * - * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger), - * default timeslice is 100 msecs, maximum timeslice is 800 msecs. - * Timeslices get refilled after they expire. - */ -#define MIN_TIMESLICE max(5 * HZ / 1000, 1) -#define DEF_TIMESLICE (100 * HZ / 1000) - #ifdef CONFIG_SMP /* * Divide a load by a sched group cpu_power : (load / sg-__cpu_power) @@ -131,24 +121,6 @@ static inline void sg_inc_cpu_power(stru sg-reciprocal_cpu_power = reciprocal_value(sg-__cpu_power); } #endif - -#define SCALE_PRIO(x, prio) \ - max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE) - -/* - * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ] - * to time slice values: [800ms ... 100ms ... 5ms] - */ -static unsigned int static_prio_timeslice(int static_prio) -{ - if (static_prio == NICE_TO_PRIO(19)) - return 1; - - if (static_prio NICE_TO_PRIO(0)) - return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio); - else - return SCALE_PRIO(DEF_TIMESLICE, static_prio); -} static inline int rt_policy(int policy) { @@ -4784,8 +4756,7 @@ long sys_sched_rr_get_interval(pid_t pid if (retval) goto out_unlock; - jiffies_to_timespec(p-policy == SCHED_FIFO ? -0 : static_prio_timeslice(p-static_prio), t); + jiffies_to_timespec(p-sched_class-default_timeslice(p), t); read_unlock(tasklist_lock); retval = copy_to_user(interval, t, sizeof(t)) ? -EFAULT : 0; out_nounlock: diff -r 3df82b0661ca kernel/sched_fair.c --- a/kernel/sched_fair.c Mon Sep 03 12:06:59 2007 +1000 +++ b/kernel/sched_fair.c Mon Sep 03 12:06:59 2007 +1000 @@ -1159,6 +1159,11 @@ static void set_curr_task_fair(struct rq } #endif +static unsigned int default_timeslice_fair(struct task_struct *p) +{ + return NS_TO_JIFFIES(sysctl_sched_min_granularity); +} + /* * All the scheduling class methods: */ @@ -1180,6 +1185,7 @@ struct sched_class fair_sched_class __re .set_curr_task = set_curr_task_fair, .task_tick = task_tick_fair, .task_new = task_new_fair, + .default_timeslice = default_timeslice_fair, }; #ifdef CONFIG_SCHED_DEBUG diff -r 3df82b0661ca kernel/sched_idletask.c --- a/kernel/sched_idletask.c Mon Sep 03 12:06:59 2007 +1000 +++ b/kernel/sched_idletask.c Mon Sep 03 12:06:59 2007 +1000 @@ -59,6 +59,11 @@ static void task_tick_idle(struct rq *rq { } +static unsigned int default_timeslice_idle(struct task_struct *p) +{ + return 0; +} + /* * Simple, special scheduling class for the per-CPU idle tasks: */ @@ -80,4 +85,5 @@ static struct sched_class idle_sched_cla .task_tick = task_tick_idle, /* no .task_new for idle tasks */ + .default_timeslice = default_timeslice_idle, }; diff -r 3df82b0661ca kernel/sched_rt.c --- a/kernel/sched_rt.c Mon Sep 03 12:06:59 2007 +1000 +++ b/kernel/sched_rt.c Mon Sep 03 12:06:59 2007 +1000 @@ -205,6 +205,34 @@ move_one_task_rt(struct rq *this_rq, int } #endif +/* + * These are the 'tuning knobs' of the scheduler: + * + * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger), + * default timeslice is 100 msecs, maximum timeslice is 800 msecs. + * Timeslices get refilled after they expire. + */ +#define MIN_TIMESLICE max(5 * HZ / 1000, 1) +#define DEF_TIMESLICE (100 * HZ / 1000) + +#define SCALE_PRIO(x, prio) \ + max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE) + +/* + * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ] + * to time slice values: [800ms ... 100ms ... 5ms] + */ +static unsigned int static_prio_timeslice(int static_prio) +{ + if (static_prio == NICE_TO_PRIO
Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()
Dmitry Adamushko wrote: On 11/10/2007, Ingo Molnar [EMAIL PROTECTED] wrote: * Peter Williams [EMAIL PROTECTED] wrote: -#define MIN_TIMESLICEmax(5 * HZ / 1000, 1) -#define DEF_TIMESLICE(100 * HZ / 1000) hm, this got removed by Dmitry quite some time ago. Could you please do this patch against the sched-devel git tree: here is the commit: http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a=commit;h=dd3fec36addd1bf76b05225b7e483378b80c3f9e I had also considered introducing smth like sched_class::task_timeslice() but decided it was not worth it. The reason I was going that route was for modularity (which helps when adding plugsched patches). I'll submit a revised patch for consideration. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Reduce overhead in balance_tasks()
Ingo Molnar wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: At the moment, balance_tasks() provides low level functionality for both move_tasks() and move_one_task() (indirectly) via the load_balance() function (in the sched_class interface) which also provides dual functionality. This dual functionality complicates the interfaces and internal mechanisms and makes the run time overhead of operations that are called with two run queue locks held. This patch addresses this issue and reduces the overhead of these operations. hm, i like it, and added it to my queue (probably .24 material though), but note that it increases .text and .data overhead: textdata bss dec hex filename 41028 377942168 80990 13c5e sched.o.before 41349 378262168 81343 13dbf sched.o.after is that expected? Yes, sort off. It's a trade off of space for time and I expected an increase (although I didn't think that it would be quite that much). But it's still less than 1% and since the time saved is time when two run queue locks are held I figure that it's a trade worth making. Also this separation lays the ground for a clean up of the active load balancing code which should gain some space including making it possible to exclude active load balancing on systems that don't need it (i.e. those that don't have multiple multi core/hyperthreading packages). I've got a patch available that reduces the .text and .data for non SMP systems by excluding the load balancing stuff (that has crept into those systems) so that should help on embedded systems where memory is tight. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Reduce overhead in balance_tasks()
Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: At the moment, balance_tasks() provides low level functionality for both move_tasks() and move_one_task() (indirectly) via the load_balance() function (in the sched_class interface) which also provides dual functionality. This dual functionality complicates the interfaces and internal mechanisms and makes the run time overhead of operations that are called with two run queue locks held. This patch addresses this issue and reduces the overhead of these operations. hm, i like it, and added it to my queue (probably .24 material though), but note that it increases .text and .data overhead: textdata bss dec hex filename 41028 377942168 80990 13c5e sched.o.before 41349 378262168 81343 13dbf sched.o.after is that expected? Yes, sort off. It's a trade off of space for time and I expected an increase (although I didn't think that it would be quite that much). But it's still less than 1% and since the time saved is time when two run queue locks are held I figure that it's a trade worth making. Also this separation lays the ground for a clean up of the active load balancing code which should gain some space including making it possible to exclude active load balancing on systems that don't need it (i.e. those that don't have multiple multi core/hyperthreading packages). I've got a patch available that reduces the .text and .data for non SMP systems by excluding the load balancing stuff (that has crept into those systems) so that should help on embedded systems where memory is tight. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: Reduce overhead in balance_tasks()
At the moment, balance_tasks() provides low level functionality for both move_tasks() and move_one_task() (indirectly) via the load_balance() function (in the sched_class interface) which also provides dual functionality. This dual functionality complicates the interfaces and internal mechanisms and makes the run time overhead of operations that are called with two run queue locks held. This patch addresses this issue and reduces the overhead of these operations. This patch is not urgent and can be held back until the next merge window without compromising the safety of the kernel. Signed-off-by: Peter Williams <[EMAIL PROTECTED]> Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce diff -r 90691a597f06 include/linux/sched.h --- a/include/linux/sched.h Mon Aug 13 15:06:35 2007 + +++ b/include/linux/sched.h Tue Aug 14 11:11:47 2007 +1000 @@ -865,10 +865,13 @@ struct sched_class { void (*put_prev_task) (struct rq *rq, struct task_struct *p); unsigned long (*load_balance) (struct rq *this_rq, int this_cpu, - struct rq *busiest, - unsigned long max_nr_move, unsigned long max_load_move, + struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, int *this_best_prio); + + int (*move_one_task) (struct rq *this_rq, int this_cpu, + struct rq *busiest, struct sched_domain *sd, + enum cpu_idle_type idle); void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); diff -r 90691a597f06 kernel/sched.c --- a/kernel/sched.c Mon Aug 13 15:06:35 2007 + +++ b/kernel/sched.c Tue Aug 14 16:26:24 2007 +1000 @@ -753,11 +753,35 @@ struct rq_iterator { struct task_struct *(*next)(void *); }; -static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, - unsigned long max_nr_move, unsigned long max_load_move, - struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned, unsigned long *load_moved, - int *this_best_prio, struct rq_iterator *iterator); +#ifdef CONFIG_SMP +static unsigned long +balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, + unsigned long max_load_move, struct sched_domain *sd, + enum cpu_idle_type idle, int *all_pinned, + int *this_best_prio, struct rq_iterator *iterator); + +static int +iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, + struct sched_domain *sd, enum cpu_idle_type idle, + struct rq_iterator *iterator); +#else +static inline unsigned long +balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, + unsigned long max_load_move, struct sched_domain *sd, + enum cpu_idle_type idle, int *all_pinned, + int *this_best_prio, struct rq_iterator *iterator) +{ + return 0; +} + +static inline int +iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, + struct sched_domain *sd, enum cpu_idle_type idle, + struct rq_iterator *iterator) +{ + return 0; +} +#endif #include "sched_stats.h" #include "sched_rt.c" @@ -2166,17 +2190,17 @@ int can_migrate_task(struct task_struct return 1; } -static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, - unsigned long max_nr_move, unsigned long max_load_move, - struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned, unsigned long *load_moved, - int *this_best_prio, struct rq_iterator *iterator) +static unsigned long +balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, + unsigned long max_load_move, struct sched_domain *sd, + enum cpu_idle_type idle, int *all_pinned, + int *this_best_prio, struct rq_iterator *iterator) { int pulled = 0, pinned = 0, skip_for_load; struct task_struct *p; long rem_load_move = max_load_move; - if (max_nr_move == 0 || max_load_move == 0) + if (max_load_move == 0) goto out; pinned = 1; @@ -2209,7 +2233,7 @@ next: * We only want to steal up to the prescribed number of tasks * and the prescribed amount of weighted load. */ - if (pulled < max_nr_move && rem_load_move > 0) { + if (rem_load_move > 0) { if (p->prio < *this_best_prio) *this_best_prio = p->prio; p = iterator->next(iterator->arg); @@ -2217,7 +2241,7 @@ next: } out: /* - * Right now, this is the only place pull_task() is called, + * Right now, this is one of only two places pull_task() is called, * so we can safely collect pull_task() stats here rather than * inside pull_task(). */ @@ -2225,8 +2249,8 @@ out: if (all_pinned) *all_pinned = pinned; - *load_moved = max_load_move - rem_load_move; - return pulled; + + return max_load_move - rem_load_move; } /* @@ -2248,7 +2272,7 @@ static int move_tasks(struct rq *t
[PATCH] sched: Reduce overhead in balance_tasks()
At the moment, balance_tasks() provides low level functionality for both move_tasks() and move_one_task() (indirectly) via the load_balance() function (in the sched_class interface) which also provides dual functionality. This dual functionality complicates the interfaces and internal mechanisms and makes the run time overhead of operations that are called with two run queue locks held. This patch addresses this issue and reduces the overhead of these operations. This patch is not urgent and can be held back until the next merge window without compromising the safety of the kernel. Signed-off-by: Peter Williams [EMAIL PROTECTED] Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce diff -r 90691a597f06 include/linux/sched.h --- a/include/linux/sched.h Mon Aug 13 15:06:35 2007 + +++ b/include/linux/sched.h Tue Aug 14 11:11:47 2007 +1000 @@ -865,10 +865,13 @@ struct sched_class { void (*put_prev_task) (struct rq *rq, struct task_struct *p); unsigned long (*load_balance) (struct rq *this_rq, int this_cpu, - struct rq *busiest, - unsigned long max_nr_move, unsigned long max_load_move, + struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, int *this_best_prio); + + int (*move_one_task) (struct rq *this_rq, int this_cpu, + struct rq *busiest, struct sched_domain *sd, + enum cpu_idle_type idle); void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); diff -r 90691a597f06 kernel/sched.c --- a/kernel/sched.c Mon Aug 13 15:06:35 2007 + +++ b/kernel/sched.c Tue Aug 14 16:26:24 2007 +1000 @@ -753,11 +753,35 @@ struct rq_iterator { struct task_struct *(*next)(void *); }; -static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, - unsigned long max_nr_move, unsigned long max_load_move, - struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned, unsigned long *load_moved, - int *this_best_prio, struct rq_iterator *iterator); +#ifdef CONFIG_SMP +static unsigned long +balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, + unsigned long max_load_move, struct sched_domain *sd, + enum cpu_idle_type idle, int *all_pinned, + int *this_best_prio, struct rq_iterator *iterator); + +static int +iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, + struct sched_domain *sd, enum cpu_idle_type idle, + struct rq_iterator *iterator); +#else +static inline unsigned long +balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, + unsigned long max_load_move, struct sched_domain *sd, + enum cpu_idle_type idle, int *all_pinned, + int *this_best_prio, struct rq_iterator *iterator) +{ + return 0; +} + +static inline int +iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest, + struct sched_domain *sd, enum cpu_idle_type idle, + struct rq_iterator *iterator) +{ + return 0; +} +#endif #include sched_stats.h #include sched_rt.c @@ -2166,17 +2190,17 @@ int can_migrate_task(struct task_struct return 1; } -static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, - unsigned long max_nr_move, unsigned long max_load_move, - struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned, unsigned long *load_moved, - int *this_best_prio, struct rq_iterator *iterator) +static unsigned long +balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, + unsigned long max_load_move, struct sched_domain *sd, + enum cpu_idle_type idle, int *all_pinned, + int *this_best_prio, struct rq_iterator *iterator) { int pulled = 0, pinned = 0, skip_for_load; struct task_struct *p; long rem_load_move = max_load_move; - if (max_nr_move == 0 || max_load_move == 0) + if (max_load_move == 0) goto out; pinned = 1; @@ -2209,7 +2233,7 @@ next: * We only want to steal up to the prescribed number of tasks * and the prescribed amount of weighted load. */ - if (pulled max_nr_move rem_load_move 0) { + if (rem_load_move 0) { if (p-prio *this_best_prio) *this_best_prio = p-prio; p = iterator-next(iterator-arg); @@ -2217,7 +2241,7 @@ next: } out: /* - * Right now, this is the only place pull_task() is called, + * Right now, this is one of only two places pull_task() is called, * so we can safely collect pull_task() stats here rather than * inside pull_task(). */ @@ -2225,8 +2249,8 @@ out: if (all_pinned) *all_pinned = pinned; - *load_moved = max_load_move - rem_load_move; - return pulled; + + return max_load_move - rem_load_move; } /* @@ -2248,7 +2272,7 @@ static int move_tasks(struct rq *this_rq do { total_load_moved += class-load_balance(this_rq, this_cpu, busiest
[PATCH] sched: Fix bug in balance_tasks()
There are two problems with balance_tasks() and how it used: 1. The variables best_prio and best_prio_seen (inherited from the old move_tasks()) were only required to handle problems caused by the active/expired arrays, the order in which they were processed and the possibility that the task with the highest priority could be on either. These issues are no longer present and the extra overhead associated with their use is unnecessary (and possibly wrong). 2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same this_best_prio variable needs to be used by all scheduling classes or there is a risk of moving too much load. E.g. if the highest priority task on this at the beginning is a fairly low priority task and the rt class migrates a task (during its turn) then that moved task becomes the new highest priority task on this_rq but when the sched_fair class initializes its copy of this_best_prio it will get the priority of the original highest priority task as, due to the run queue locks being held, the reschedule triggered by pull_task() will not have taken place. This could result in inappropriate overriding of skip_for_load and excessive load being moved. The attached patch addresses these problems by deleting all reference to best_prio and best_prio_seen and making this_best_prio a reference parameter to the various functions involved. load_balance_fair() has also been modified so that this_best_prio is only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set. This should preserve the effect of helping spread groups' higher priority tasks around the available CPUs while improving system performance when CONFIG_FAIR_GROUP_SCHED isn't set. Signed-off-by: Peter Williams <[EMAIL PROTECTED]> Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce diff -r c39ddf75cd08 include/linux/sched.h --- a/include/linux/sched.h Mon Aug 06 16:08:52 2007 +1000 +++ b/include/linux/sched.h Mon Aug 06 16:13:20 2007 +1000 @@ -870,7 +870,7 @@ struct sched_class { struct rq *busiest, unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned); + int *all_pinned, int *this_best_prio); void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); diff -r c39ddf75cd08 kernel/sched.c --- a/kernel/sched.c Mon Aug 06 16:08:52 2007 +1000 +++ b/kernel/sched.c Mon Aug 06 16:52:59 2007 +1000 @@ -745,8 +745,7 @@ static int balance_tasks(struct rq *this unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, unsigned long *load_moved, - int this_best_prio, int best_prio, int best_prio_seen, - struct rq_iterator *iterator); + int *this_best_prio, struct rq_iterator *iterator); #include "sched_stats.h" #include "sched_rt.c" @@ -2166,8 +2165,7 @@ static int balance_tasks(struct rq *this unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, unsigned long *load_moved, - int this_best_prio, int best_prio, int best_prio_seen, - struct rq_iterator *iterator) + int *this_best_prio, struct rq_iterator *iterator) { int pulled = 0, pinned = 0, skip_for_load; struct task_struct *p; @@ -2192,12 +2190,8 @@ next: */ skip_for_load = (p->se.load.weight >> 1) > rem_load_move + SCHED_LOAD_SCALE_FUZZ; - if (skip_for_load && p->prio < this_best_prio) - skip_for_load = !best_prio_seen && p->prio == best_prio; - if (skip_for_load || + if ((skip_for_load && p->prio >= *this_best_prio) || !can_migrate_task(p, busiest, this_cpu, sd, idle, )) { - - best_prio_seen |= p->prio == best_prio; p = iterator->next(iterator->arg); goto next; } @@ -2211,8 +2205,8 @@ next: * and the prescribed amount of weighted load. */ if (pulled < max_nr_move && rem_load_move > 0) { - if (p->prio < this_best_prio) - this_best_prio = p->prio; + if (p->prio < *this_best_prio) + *this_best_prio = p->prio; p = iterator->next(iterator->arg); goto next; } @@ -2244,12 +2238,13 @@ static int move_tasks(struct rq *this_rq { struct sched_class *class = sched_class_highest; unsigned long total_load_moved = 0; + int this_best_prio = this_rq->curr->prio; do { total_load_moved += class->load_balance(this_rq, this_cpu, busiest, ULONG_MAX, max_load_move - total_load_moved, -sd, idle, all_pinned); +sd, idle, all_pinned, _best_prio); class = class->next; } while (class && max_load_move > total_load_moved); @@ -2267,10 +2262,12 @@ static int move_on
[PATCH] sched: Fix bug in balance_tasks()
There are two problems with balance_tasks() and how it used: 1. The variables best_prio and best_prio_seen (inherited from the old move_tasks()) were only required to handle problems caused by the active/expired arrays, the order in which they were processed and the possibility that the task with the highest priority could be on either. These issues are no longer present and the extra overhead associated with their use is unnecessary (and possibly wrong). 2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same this_best_prio variable needs to be used by all scheduling classes or there is a risk of moving too much load. E.g. if the highest priority task on this at the beginning is a fairly low priority task and the rt class migrates a task (during its turn) then that moved task becomes the new highest priority task on this_rq but when the sched_fair class initializes its copy of this_best_prio it will get the priority of the original highest priority task as, due to the run queue locks being held, the reschedule triggered by pull_task() will not have taken place. This could result in inappropriate overriding of skip_for_load and excessive load being moved. The attached patch addresses these problems by deleting all reference to best_prio and best_prio_seen and making this_best_prio a reference parameter to the various functions involved. load_balance_fair() has also been modified so that this_best_prio is only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set. This should preserve the effect of helping spread groups' higher priority tasks around the available CPUs while improving system performance when CONFIG_FAIR_GROUP_SCHED isn't set. Signed-off-by: Peter Williams [EMAIL PROTECTED] Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce diff -r c39ddf75cd08 include/linux/sched.h --- a/include/linux/sched.h Mon Aug 06 16:08:52 2007 +1000 +++ b/include/linux/sched.h Mon Aug 06 16:13:20 2007 +1000 @@ -870,7 +870,7 @@ struct sched_class { struct rq *busiest, unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned); + int *all_pinned, int *this_best_prio); void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); diff -r c39ddf75cd08 kernel/sched.c --- a/kernel/sched.c Mon Aug 06 16:08:52 2007 +1000 +++ b/kernel/sched.c Mon Aug 06 16:52:59 2007 +1000 @@ -745,8 +745,7 @@ static int balance_tasks(struct rq *this unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, unsigned long *load_moved, - int this_best_prio, int best_prio, int best_prio_seen, - struct rq_iterator *iterator); + int *this_best_prio, struct rq_iterator *iterator); #include sched_stats.h #include sched_rt.c @@ -2166,8 +2165,7 @@ static int balance_tasks(struct rq *this unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, unsigned long *load_moved, - int this_best_prio, int best_prio, int best_prio_seen, - struct rq_iterator *iterator) + int *this_best_prio, struct rq_iterator *iterator) { int pulled = 0, pinned = 0, skip_for_load; struct task_struct *p; @@ -2192,12 +2190,8 @@ next: */ skip_for_load = (p-se.load.weight 1) rem_load_move + SCHED_LOAD_SCALE_FUZZ; - if (skip_for_load p-prio this_best_prio) - skip_for_load = !best_prio_seen p-prio == best_prio; - if (skip_for_load || + if ((skip_for_load p-prio = *this_best_prio) || !can_migrate_task(p, busiest, this_cpu, sd, idle, pinned)) { - - best_prio_seen |= p-prio == best_prio; p = iterator-next(iterator-arg); goto next; } @@ -2211,8 +2205,8 @@ next: * and the prescribed amount of weighted load. */ if (pulled max_nr_move rem_load_move 0) { - if (p-prio this_best_prio) - this_best_prio = p-prio; + if (p-prio *this_best_prio) + *this_best_prio = p-prio; p = iterator-next(iterator-arg); goto next; } @@ -2244,12 +2238,13 @@ static int move_tasks(struct rq *this_rq { struct sched_class *class = sched_class_highest; unsigned long total_load_moved = 0; + int this_best_prio = this_rq-curr-prio; do { total_load_moved += class-load_balance(this_rq, this_cpu, busiest, ULONG_MAX, max_load_move - total_load_moved, -sd, idle, all_pinned); +sd, idle, all_pinned, this_best_prio); class = class-next; } while (class max_load_move total_load_moved); @@ -2267,10 +2262,12 @@ static int move_one_task(struct rq *this struct sched_domain *sd, enum cpu_idle_type idle) { struct sched_class *class; + int this_best_prio = MAX_PRIO; for (class = sched_class_highest; class; class = class-next
Possible error in 2.6.23-rc2-rt1 series
I've just been reviewing these patches and have spotted a possible error in the file arch/ia64/kernel/time.c in that the scope of the #ifdef on CONFIG_TIME_INTERPOLATION seems to have grown quite a lot since 2.2.23-rc1-rt7. It used to chop out one if statement and now it chops out half the file. Is it correct? Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Possible error in 2.6.23-rc2-rt1 series
I've just been reviewing these patches and have spotted a possible error in the file arch/ia64/kernel/time.c in that the scope of the #ifdef on CONFIG_TIME_INTERPOLATION seems to have grown quite a lot since 2.2.23-rc1-rt7. It used to chop out one if statement and now it chops out half the file. Is it correct? Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: Simplify move_tasks()
The move_tasks() function is currently multiplexed with two distinct capabilities: 1. attempt to move a specified amount of weighted load from one run queue to another; and 2. attempt to move a specified number of tasks from one run queue to another. The first of these capabilities is used in two places, load_balance() and load_balance_idle(), and in both of these cases the return value of move_tasks() is used purely to decide if tasks/load were moved and no notice of the actual number of tasks moved is taken. The second capability is used in exactly one place, active_load_balance(), to attempt to move exactly one task and, as before, the return value is only used as an indicator of success or failure. This multiplexing of sched_task() was introduced, by me, as part of the smpnice patches and was motivated by the fact that the alternative, one function to move specified load and one to move a single task, would have led to two functions of roughly the same complexity as the old move_tasks() (or the new balance_tasks()). However, the new modular design of the new CFS scheduler allows a simpler solution to be adopted and this patch addresses that solution by: 1. adding a new function, move_one_task(), to be used by active_load_balance(); and 2. making move_tasks() a single purpose function that tries to move a specified weighted load and returns 1 for success and 0 for failure. One of the consequences of these changes is that neither move_one_task() or the new move_tasks() care how many tasks sched_class.load_balance() moves and this enables its interface to be simplified by returning the amount of load moved as its result and removing the load_moved pointer from the argument list. This helps simplify the new move_tasks() and slightly reduces the amount of work done in each of sched_class.load_balance()'s implementations. Further simplification, e.g. changes to balance_tasks(), are possible but (slightly) complicated by the special needs of load_balance_fair() so I've left them to a later patch (if this one gets accepted). NB Since move_tasks() gets called with two run queue locks held even small reductions in overhead are worthwhile. Signed-off-by: Peter Williams <[EMAIL PROTECTED]> -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce diff -r b97e7dab8f7b include/linux/sched.h --- a/include/linux/sched.h Thu Aug 02 14:08:53 2007 -0700 +++ b/include/linux/sched.h Fri Aug 03 15:56:41 2007 +1000 @@ -866,11 +866,11 @@ struct sched_class { struct task_struct * (*pick_next_task) (struct rq *rq, u64 now); void (*put_prev_task) (struct rq *rq, struct task_struct *p, u64 now); - int (*load_balance) (struct rq *this_rq, int this_cpu, + unsigned long (*load_balance) (struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned, unsigned long *total_load_moved); + int *all_pinned); void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); diff -r b97e7dab8f7b kernel/sched.c --- a/kernel/sched.c Thu Aug 02 14:08:53 2007 -0700 +++ b/kernel/sched.c Sat Aug 04 10:06:42 2007 +1000 @@ -2231,32 +2231,49 @@ out: } /* - * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted - * load from busiest to this_rq, as part of a balancing operation within - * "domain". Returns the number of tasks moved. + * move_tasks tries to move up to max_load_move weighted load from busiest to + * this_rq, as part of a balancing operation within domain "sd". + * Returns 1 if successful and 0 otherwise. * * Called with both runqueues locked. */ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, - unsigned long max_nr_move, unsigned long max_load_move, + unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned) { struct sched_class *class = sched_class_highest; - unsigned long load_moved, total_nr_moved = 0, nr_moved; - long rem_load_move = max_load_move; + unsigned long total_load_moved = 0; do { - nr_moved = class->load_balance(this_rq, this_cpu, busiest, -max_nr_move, (unsigned long)rem_load_move, -sd, idle, all_pinned, _moved); - total_nr_moved += nr_moved; - max_nr_move -= nr_moved; - rem_load_move -= load_moved; + total_load_moved += + class->load_balance(this_rq, this_cpu, busiest, +ULONG_MAX, max_load_move - total_load_moved, +sd, idle, all_pinned); class = class->next; - } while (class && max_nr_move && rem_load_move > 0); - - return total_nr_moved; + } while (class && max_load_move > total_load_moved); + + return total_load_moved > 0; +} + +/* + * move_one_task tries to
[PATCH] sched: Simplify move_tasks()
The move_tasks() function is currently multiplexed with two distinct capabilities: 1. attempt to move a specified amount of weighted load from one run queue to another; and 2. attempt to move a specified number of tasks from one run queue to another. The first of these capabilities is used in two places, load_balance() and load_balance_idle(), and in both of these cases the return value of move_tasks() is used purely to decide if tasks/load were moved and no notice of the actual number of tasks moved is taken. The second capability is used in exactly one place, active_load_balance(), to attempt to move exactly one task and, as before, the return value is only used as an indicator of success or failure. This multiplexing of sched_task() was introduced, by me, as part of the smpnice patches and was motivated by the fact that the alternative, one function to move specified load and one to move a single task, would have led to two functions of roughly the same complexity as the old move_tasks() (or the new balance_tasks()). However, the new modular design of the new CFS scheduler allows a simpler solution to be adopted and this patch addresses that solution by: 1. adding a new function, move_one_task(), to be used by active_load_balance(); and 2. making move_tasks() a single purpose function that tries to move a specified weighted load and returns 1 for success and 0 for failure. One of the consequences of these changes is that neither move_one_task() or the new move_tasks() care how many tasks sched_class.load_balance() moves and this enables its interface to be simplified by returning the amount of load moved as its result and removing the load_moved pointer from the argument list. This helps simplify the new move_tasks() and slightly reduces the amount of work done in each of sched_class.load_balance()'s implementations. Further simplification, e.g. changes to balance_tasks(), are possible but (slightly) complicated by the special needs of load_balance_fair() so I've left them to a later patch (if this one gets accepted). NB Since move_tasks() gets called with two run queue locks held even small reductions in overhead are worthwhile. Signed-off-by: Peter Williams [EMAIL PROTECTED] -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce diff -r b97e7dab8f7b include/linux/sched.h --- a/include/linux/sched.h Thu Aug 02 14:08:53 2007 -0700 +++ b/include/linux/sched.h Fri Aug 03 15:56:41 2007 +1000 @@ -866,11 +866,11 @@ struct sched_class { struct task_struct * (*pick_next_task) (struct rq *rq, u64 now); void (*put_prev_task) (struct rq *rq, struct task_struct *p, u64 now); - int (*load_balance) (struct rq *this_rq, int this_cpu, + unsigned long (*load_balance) (struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned, unsigned long *total_load_moved); + int *all_pinned); void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); diff -r b97e7dab8f7b kernel/sched.c --- a/kernel/sched.c Thu Aug 02 14:08:53 2007 -0700 +++ b/kernel/sched.c Sat Aug 04 10:06:42 2007 +1000 @@ -2231,32 +2231,49 @@ out: } /* - * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted - * load from busiest to this_rq, as part of a balancing operation within - * domain. Returns the number of tasks moved. + * move_tasks tries to move up to max_load_move weighted load from busiest to + * this_rq, as part of a balancing operation within domain sd. + * Returns 1 if successful and 0 otherwise. * * Called with both runqueues locked. */ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, - unsigned long max_nr_move, unsigned long max_load_move, + unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned) { struct sched_class *class = sched_class_highest; - unsigned long load_moved, total_nr_moved = 0, nr_moved; - long rem_load_move = max_load_move; + unsigned long total_load_moved = 0; do { - nr_moved = class-load_balance(this_rq, this_cpu, busiest, -max_nr_move, (unsigned long)rem_load_move, -sd, idle, all_pinned, load_moved); - total_nr_moved += nr_moved; - max_nr_move -= nr_moved; - rem_load_move -= load_moved; + total_load_moved += + class-load_balance(this_rq, this_cpu, busiest, +ULONG_MAX, max_load_move - total_load_moved, +sd, idle, all_pinned); class = class-next; - } while (class max_nr_move rem_load_move 0); - - return total_nr_moved; + } while (class max_load_move total_load_moved); + + return total_load_moved 0; +} + +/* + * move_one_task tries to move exactly one task from busiest to this_rq, as + * part of active balancing operations within domain
[PATCH] Tidy up left over smpnice code after changes introduced with CFS
1. The only place that RTPRIO_TO_LOAD_WEIGHT() is used is in the call to move_tasks() in the function active_load_balance() and its purpose here is just to make sure that the load to be moved is big enough to ensure that exactly one task is moved (if there's one available). This can be accomplished by using ULONG_MAX instead and this allows RTPRIO_TO_LOAD_WEIGHT() to be deleted. 2. This, in turn, allows PRIO_TO_LOAD_WEIGHT() to be deleted. 3. This allows load_weight() to be deleted which allows TIME_SLICE_NICE_ZERO to be deleted along with the comment above it. Signed-off-by: Peter Williams <[EMAIL PROTECTED]> -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce diff -r 622a128d084b kernel/sched.c --- a/kernel/sched.c Mon Jul 30 21:54:37 2007 -0700 +++ b/kernel/sched.c Thu Aug 02 16:21:19 2007 +1000 @@ -727,19 +727,6 @@ static void update_curr_load(struct rq * * slice expiry etc. */ -/* - * Assume: static_prio_timeslice(NICE_TO_PRIO(0)) == DEF_TIMESLICE - * If static_prio_timeslice() is ever changed to break this assumption then - * this code will need modification - */ -#define TIME_SLICE_NICE_ZERO DEF_TIMESLICE -#define load_weight(lp) \ - (((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO) -#define PRIO_TO_LOAD_WEIGHT(prio) \ - load_weight(static_prio_timeslice(prio)) -#define RTPRIO_TO_LOAD_WEIGHT(rp) \ - (PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + load_weight(rp)) - #define WEIGHT_IDLEPRIO 2 #define WMULT_IDLEPRIO (1 << 31) @@ -2906,8 +2893,7 @@ static void active_load_balance(struct r schedstat_inc(sd, alb_cnt); if (move_tasks(target_rq, target_cpu, busiest_rq, 1, - RTPRIO_TO_LOAD_WEIGHT(100), sd, CPU_IDLE, - NULL)) + ULONG_MAX, sd, CPU_IDLE, NULL)) schedstat_inc(sd, alb_pushed); else schedstat_inc(sd, alb_failed);
[PATCH] Tidy up left over smpnice code after changes introduced with CFS
1. The only place that RTPRIO_TO_LOAD_WEIGHT() is used is in the call to move_tasks() in the function active_load_balance() and its purpose here is just to make sure that the load to be moved is big enough to ensure that exactly one task is moved (if there's one available). This can be accomplished by using ULONG_MAX instead and this allows RTPRIO_TO_LOAD_WEIGHT() to be deleted. 2. This, in turn, allows PRIO_TO_LOAD_WEIGHT() to be deleted. 3. This allows load_weight() to be deleted which allows TIME_SLICE_NICE_ZERO to be deleted along with the comment above it. Signed-off-by: Peter Williams [EMAIL PROTECTED] -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce diff -r 622a128d084b kernel/sched.c --- a/kernel/sched.c Mon Jul 30 21:54:37 2007 -0700 +++ b/kernel/sched.c Thu Aug 02 16:21:19 2007 +1000 @@ -727,19 +727,6 @@ static void update_curr_load(struct rq * * slice expiry etc. */ -/* - * Assume: static_prio_timeslice(NICE_TO_PRIO(0)) == DEF_TIMESLICE - * If static_prio_timeslice() is ever changed to break this assumption then - * this code will need modification - */ -#define TIME_SLICE_NICE_ZERO DEF_TIMESLICE -#define load_weight(lp) \ - (((lp) * SCHED_LOAD_SCALE) / TIME_SLICE_NICE_ZERO) -#define PRIO_TO_LOAD_WEIGHT(prio) \ - load_weight(static_prio_timeslice(prio)) -#define RTPRIO_TO_LOAD_WEIGHT(rp) \ - (PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + load_weight(rp)) - #define WEIGHT_IDLEPRIO 2 #define WMULT_IDLEPRIO (1 31) @@ -2906,8 +2893,7 @@ static void active_load_balance(struct r schedstat_inc(sd, alb_cnt); if (move_tasks(target_rq, target_cpu, busiest_rq, 1, - RTPRIO_TO_LOAD_WEIGHT(100), sd, CPU_IDLE, - NULL)) + ULONG_MAX, sd, CPU_IDLE, NULL)) schedstat_inc(sd, alb_pushed); else schedstat_inc(sd, alb_failed);
Minor errors in 2.6.23-rc1-rt2 series
I've just been reviewing these patches and have spotted a couple of errors that look like they were caused by fuzz during the patch process. A patch that corrects the errors is attached. Cheers Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce diff -r e02fd64426b9 arch/i386/boot/compressed/Makefile --- a/arch/i386/boot/compressed/MakefileThu Jul 26 10:33:58 2007 +1000 +++ b/arch/i386/boot/compressed/MakefileThu Jul 26 11:17:35 2007 +1000 @@ -9,10 +9,9 @@ EXTRA_AFLAGS := -traditional EXTRA_AFLAGS := -traditional LDFLAGS_vmlinux := -T -CFLAGS := -m32 -D__KERNEL__ -Iinclude -O2 -fno-strict-aliasing hostprogs-y:= relocs -CFLAGS := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -O2 \ +CFLAGS := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -Iinclude -O2 \ -fno-strict-aliasing -fPIC \ $(call cc-option,-ffreestanding) \ $(call cc-option,-fno-stack-protector) diff -r e02fd64426b9 arch/i386/kernel/smp.c --- a/arch/i386/kernel/smp.cThu Jul 26 10:33:58 2007 +1000 +++ b/arch/i386/kernel/smp.cThu Jul 26 11:17:35 2007 +1000 @@ -651,7 +651,6 @@ fastcall notrace void smp_reschedule_int fastcall notrace void smp_reschedule_interrupt(struct pt_regs *regs) { trace_special(regs->eip, 0, 0); - trace_special(regs->eip, 0, 0); ack_APIC_irq(); set_tsk_need_resched(current); } diff -r e02fd64426b9 include/asm-mips/mipsregs.h --- a/include/asm-mips/mipsregs.h Thu Jul 26 10:33:58 2007 +1000 +++ b/include/asm-mips/mipsregs.h Thu Jul 26 11:17:35 2007 +1000 @@ -710,7 +710,7 @@ do { \ unsigned long long __val; \ unsigned long __flags; \ \ - local_irq_save(flags); \ + local_irq_save(__flags);\ if (sel == 0) \ __asm__ __volatile__( \ ".set\tmips64\n\t" \
Minor errors in 2.6.23-rc1-rt2 series
I've just been reviewing these patches and have spotted a couple of errors that look like they were caused by fuzz during the patch process. A patch that corrects the errors is attached. Cheers Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce diff -r e02fd64426b9 arch/i386/boot/compressed/Makefile --- a/arch/i386/boot/compressed/MakefileThu Jul 26 10:33:58 2007 +1000 +++ b/arch/i386/boot/compressed/MakefileThu Jul 26 11:17:35 2007 +1000 @@ -9,10 +9,9 @@ EXTRA_AFLAGS := -traditional EXTRA_AFLAGS := -traditional LDFLAGS_vmlinux := -T -CFLAGS := -m32 -D__KERNEL__ -Iinclude -O2 -fno-strict-aliasing hostprogs-y:= relocs -CFLAGS := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -O2 \ +CFLAGS := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -Iinclude -O2 \ -fno-strict-aliasing -fPIC \ $(call cc-option,-ffreestanding) \ $(call cc-option,-fno-stack-protector) diff -r e02fd64426b9 arch/i386/kernel/smp.c --- a/arch/i386/kernel/smp.cThu Jul 26 10:33:58 2007 +1000 +++ b/arch/i386/kernel/smp.cThu Jul 26 11:17:35 2007 +1000 @@ -651,7 +651,6 @@ fastcall notrace void smp_reschedule_int fastcall notrace void smp_reschedule_interrupt(struct pt_regs *regs) { trace_special(regs-eip, 0, 0); - trace_special(regs-eip, 0, 0); ack_APIC_irq(); set_tsk_need_resched(current); } diff -r e02fd64426b9 include/asm-mips/mipsregs.h --- a/include/asm-mips/mipsregs.h Thu Jul 26 10:33:58 2007 +1000 +++ b/include/asm-mips/mipsregs.h Thu Jul 26 11:17:35 2007 +1000 @@ -710,7 +710,7 @@ do { \ unsigned long long __val; \ unsigned long __flags; \ \ - local_irq_save(flags); \ + local_irq_save(__flags);\ if (sel == 0) \ __asm__ __volatile__( \ .set\tmips64\n\t \
Re: [ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.22
Ingo Molnar wrote: > * Peter Williams <[EMAIL PROTECTED]> wrote: > >> Probably the last one now that CFS is in the main line :-(. > > hm, why is CFS in mainline a problem? It means a major rewrite of the plugsched interface and I'm not sure that it's worth it (if CFS works well). However, note that I did say probably not definitely :-). I'll play with it and see what happens. > The CFS merge should make the life > of development/test patches like plugsched conceptually easier. (it will > certainly cause a lot of churn, but that's for the better i think.) I don't think that is necessarily the case. > > Most of the schedulers in plugsched should be readily adaptable to the > modular scheduling-policy scheme of the upstream scheduler. I don't think that this necessarily true. Ingosched and ingo_ll are definitely out and I don't feel like converting staircase and nicksched as I have no real interest in them. Perhaps I'll just create the interface and some schedulers based on my own ideas and let others such as Con and Nick add schedulers if they're still that way inclined. > I'm sure > there will be some minor issues as isolation of the modules is not > enforced right now - and i'd be happy to review (and potentially apply) > common-sense patches that improve the framework. Thanks for the offer of support (it may sway my decision), Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.22
Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: Probably the last one now that CFS is in the main line :-(. hm, why is CFS in mainline a problem? It means a major rewrite of the plugsched interface and I'm not sure that it's worth it (if CFS works well). However, note that I did say probably not definitely :-). I'll play with it and see what happens. The CFS merge should make the life of development/test patches like plugsched conceptually easier. (it will certainly cause a lot of churn, but that's for the better i think.) I don't think that is necessarily the case. Most of the schedulers in plugsched should be readily adaptable to the modular scheduling-policy scheme of the upstream scheduler. I don't think that this necessarily true. Ingosched and ingo_ll are definitely out and I don't feel like converting staircase and nicksched as I have no real interest in them. Perhaps I'll just create the interface and some schedulers based on my own ideas and let others such as Con and Nick add schedulers if they're still that way inclined. I'm sure there will be some minor issues as isolation of the modules is not enforced right now - and i'd be happy to review (and potentially apply) common-sense patches that improve the framework. Thanks for the offer of support (it may sway my decision), Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Forward port of latest RT patch (2.6.21.5-rt20) to 2.6.22 available
Gene Heskett wrote: > On Friday 13 July 2007, Peter Williams wrote: >> Ingo Molnar wrote: >>> * Gregory Haskins <[EMAIL PROTECTED]> wrote: >>>> On Thu, 2007-07-12 at 14:07 +0200, Ingo Molnar wrote: >>>>> * Gregory Haskins <[EMAIL PROTECTED]> wrote: >>>>>> Hi Ingo, Thomas, and the greater linux-rt community, >>>>>> >>>>>> I just wanted to let you guys know that our team has a port of >>>>>> the 21.5-rt20 patch for the 2.6.22 kernel available. [...] >>>>> great! We had the upstream -rt port to .22 in the works too, it was just >>>>> held up by the hpet breakage - which Thomas managed to fix earlier >>>>> today. I've released the 2.6.22.1-rt1 patch to the usual place: >>>>> >>>>> http://redhat.com/~mingo/realtime-preempt/ >>>> Thats awesome, Ingo! Thanks! Could you publish a broken out version >>>> as well? We found it extremely valuable to be able to bisect this >>>> beast while working on the 21-22 port. >>> we are working on something in this area :) Stay tuned ... >> I've just been reviewing these patches and have spotted an error in the >> file mm/slob.c at lines 500-501 whereby a non existent variable "c" is >> referenced. The attached patch is a proposed fix to the problem. > > Could this explain why 2.6.22.1-rt1 seems to use a lot of swap? I've been as > high as 570 megs into swap, currently at 286megs after doing a > swapoff --a;swapon -a about 8 hours ago. No. This problem would have caused the build to fail if slob was configured. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Forward port of latest RT patch (2.6.21.5-rt20) to 2.6.22 available
Ingo Molnar wrote: > * Gregory Haskins <[EMAIL PROTECTED]> wrote: > >> On Thu, 2007-07-12 at 14:07 +0200, Ingo Molnar wrote: >>> * Gregory Haskins <[EMAIL PROTECTED]> wrote: >>> >>>> Hi Ingo, Thomas, and the greater linux-rt community, >>>> >>>>I just wanted to let you guys know that our team has a port of >>>> the 21.5-rt20 patch for the 2.6.22 kernel available. [...] >>> great! We had the upstream -rt port to .22 in the works too, it was just >>> held up by the hpet breakage - which Thomas managed to fix earlier >>> today. I've released the 2.6.22.1-rt1 patch to the usual place: >>> >>> http://redhat.com/~mingo/realtime-preempt/ >> Thats awesome, Ingo! Thanks! Could you publish a broken out version >> as well? We found it extremely valuable to be able to bisect this >> beast while working on the 21-22 port. > > we are working on something in this area :) Stay tuned ... I've just been reviewing these patches and have spotted an error in the file mm/slob.c at lines 500-501 whereby a non existent variable "c" is referenced. The attached patch is a proposed fix to the problem. -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce Fix error in realtime-preempt patch for mm/slob.c This error was caused by a change to slob_free()'s interface. Signed-off-by: Peter Williams <[EMAIL PROTECTED]> diff -r cb0010b7bffe mm/slob.c --- a/mm/slob.c Fri Jul 13 15:24:45 2007 +1000 +++ b/mm/slob.c Fri Jul 13 16:23:02 2007 +1000 @@ -493,14 +493,14 @@ void *kmem_cache_zalloc(struct kmem_cach } EXPORT_SYMBOL(kmem_cache_zalloc); -static void __kmem_cache_free(void *b, int size) +static void __kmem_cache_free(struct kmem_cache *c, void *b) { atomic_dec(>items); if (c->size <= MAX_SLOB_CACHE_SIZE) slob_free(c, b, c->size); else - free_pages((unsigned long)b, get_order(size)); + free_pages((unsigned long)b, get_order(c->size)); } static void kmem_rcu_free(struct rcu_head *head) @@ -508,7 +508,7 @@ static void kmem_rcu_free(struct rcu_hea struct slob_rcu *slob_rcu = (struct slob_rcu *)head; void *b = (void *)slob_rcu - (slob_rcu->size - sizeof(struct slob_rcu)); - __kmem_cache_free(b, slob_rcu->size); + __kmem_cache_free(slob_rcu, b); } void kmem_cache_free(struct kmem_cache *c, void *b) @@ -520,7 +520,7 @@ void kmem_cache_free(struct kmem_cache * slob_rcu->size = c->size; call_rcu(_rcu->head, kmem_rcu_free); } else { - __kmem_cache_free(b, c->size); + __kmem_cache_free(c, b); } } EXPORT_SYMBOL(kmem_cache_free);
Re: Forward port of latest RT patch (2.6.21.5-rt20) to 2.6.22 available
Ingo Molnar wrote: * Gregory Haskins [EMAIL PROTECTED] wrote: On Thu, 2007-07-12 at 14:07 +0200, Ingo Molnar wrote: * Gregory Haskins [EMAIL PROTECTED] wrote: Hi Ingo, Thomas, and the greater linux-rt community, I just wanted to let you guys know that our team has a port of the 21.5-rt20 patch for the 2.6.22 kernel available. [...] great! We had the upstream -rt port to .22 in the works too, it was just held up by the hpet breakage - which Thomas managed to fix earlier today. I've released the 2.6.22.1-rt1 patch to the usual place: http://redhat.com/~mingo/realtime-preempt/ Thats awesome, Ingo! Thanks! Could you publish a broken out version as well? We found it extremely valuable to be able to bisect this beast while working on the 21-22 port. we are working on something in this area :) Stay tuned ... I've just been reviewing these patches and have spotted an error in the file mm/slob.c at lines 500-501 whereby a non existent variable c is referenced. The attached patch is a proposed fix to the problem. -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce Fix error in realtime-preempt patch for mm/slob.c This error was caused by a change to slob_free()'s interface. Signed-off-by: Peter Williams [EMAIL PROTECTED] diff -r cb0010b7bffe mm/slob.c --- a/mm/slob.c Fri Jul 13 15:24:45 2007 +1000 +++ b/mm/slob.c Fri Jul 13 16:23:02 2007 +1000 @@ -493,14 +493,14 @@ void *kmem_cache_zalloc(struct kmem_cach } EXPORT_SYMBOL(kmem_cache_zalloc); -static void __kmem_cache_free(void *b, int size) +static void __kmem_cache_free(struct kmem_cache *c, void *b) { atomic_dec(c-items); if (c-size = MAX_SLOB_CACHE_SIZE) slob_free(c, b, c-size); else - free_pages((unsigned long)b, get_order(size)); + free_pages((unsigned long)b, get_order(c-size)); } static void kmem_rcu_free(struct rcu_head *head) @@ -508,7 +508,7 @@ static void kmem_rcu_free(struct rcu_hea struct slob_rcu *slob_rcu = (struct slob_rcu *)head; void *b = (void *)slob_rcu - (slob_rcu-size - sizeof(struct slob_rcu)); - __kmem_cache_free(b, slob_rcu-size); + __kmem_cache_free(slob_rcu, b); } void kmem_cache_free(struct kmem_cache *c, void *b) @@ -520,7 +520,7 @@ void kmem_cache_free(struct kmem_cache * slob_rcu-size = c-size; call_rcu(slob_rcu-head, kmem_rcu_free); } else { - __kmem_cache_free(b, c-size); + __kmem_cache_free(c, b); } } EXPORT_SYMBOL(kmem_cache_free);
Re: Forward port of latest RT patch (2.6.21.5-rt20) to 2.6.22 available
Gene Heskett wrote: On Friday 13 July 2007, Peter Williams wrote: Ingo Molnar wrote: * Gregory Haskins [EMAIL PROTECTED] wrote: On Thu, 2007-07-12 at 14:07 +0200, Ingo Molnar wrote: * Gregory Haskins [EMAIL PROTECTED] wrote: Hi Ingo, Thomas, and the greater linux-rt community, I just wanted to let you guys know that our team has a port of the 21.5-rt20 patch for the 2.6.22 kernel available. [...] great! We had the upstream -rt port to .22 in the works too, it was just held up by the hpet breakage - which Thomas managed to fix earlier today. I've released the 2.6.22.1-rt1 patch to the usual place: http://redhat.com/~mingo/realtime-preempt/ Thats awesome, Ingo! Thanks! Could you publish a broken out version as well? We found it extremely valuable to be able to bisect this beast while working on the 21-22 port. we are working on something in this area :) Stay tuned ... I've just been reviewing these patches and have spotted an error in the file mm/slob.c at lines 500-501 whereby a non existent variable c is referenced. The attached patch is a proposed fix to the problem. Could this explain why 2.6.22.1-rt1 seems to use a lot of swap? I've been as high as 570 megs into swap, currently at 286megs after doing a swapoff --a;swapon -a about 8 hours ago. No. This problem would have caused the build to fail if slob was configured. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.22
Probably the last one now that CFS is in the main line :-(. A patch for 2.6.22 is available at: <http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.22.patch> Very Brief Documentation: You can select a default scheduler at kernel build time. If you wish to boot with a scheduler other than the default it can be selected at boot time by adding: cpusched= to the boot command line where is one of: ingosched, ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs or zaphod. If you don't change the default when you build the kernel the default scheduler will be ingosched (which is the normal scheduler). The scheduler in force on a running system can be determined by the contents of: /proc/scheduler Control parameters for the scheduler can be read/set via files in: /sys/cpusched// Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.22
Probably the last one now that CFS is in the main line :-(. A patch for 2.6.22 is available at: http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.22.patch Very Brief Documentation: You can select a default scheduler at kernel build time. If you wish to boot with a scheduler other than the default it can be selected at boot time by adding: cpusched=scheduler to the boot command line where scheduler is one of: ingosched, ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs or zaphod. If you don't change the default when you build the kernel the default scheduler will be ingosched (which is the normal scheduler). The scheduler in force on a running system can be determined by the contents of: /proc/scheduler Control parameters for the scheduler can be read/set via files in: /sys/cpusched/scheduler/ Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote: Siddha, Suresh B wrote: I can try 32-bit kernel to check. Don't bother. I just checked 2.6.22-rc3 and the problem is not present which means something between rc2 and rc3 has fixed the problem. I hate it when problems (appear to) fix themselves as it usually means they're just hiding. I didn't see any patches between rc2 and rc3 that were likely to have fixed this (but doesn't mean there wasn't one). I'm wondering whether I should do a git bisect to see if I can find where it got fixed? Could you see if you can reproduce it on 2.6.22-rc2? No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote system at office. 15 attempts didn't show the issue. Sure that nothing changed in your test setup? More experiments tomorrow morning.. I've finished bisecting and the patch at which things appear to improve is cd5477911fc9f5cc64678e2b95cdd606c59a11b5 which is in the middle of a bunch of patches reorganizing the link phase of the build. Patch description is: kbuild: add "Section mismatch" warning whitelist for powerpc author Li Yang <[EMAIL PROTECTED]> Mon, 14 May 2007 10:04:28 + (18:04 +0800) committer Sam Ravnborg <[EMAIL PROTECTED]> Sat, 19 May 2007 07:11:57 + (09:11 +0200) commit cd5477911fc9f5cc64678e2b95cdd606c59a11b5 treed893f07b0040d36dfc60040dc695384e9afcf103tree | snapshot parent f892b7d480eec809a5dfbd6e65742b3f3155e50ecommit | diff kbuild: add "Section mismatch" warning whitelist for powerpc This patch fixes the following class of "Section mismatch" warnings when building powerpc platforms. WARNING: arch/powerpc/kernel/built-in.o - Section mismatch: reference to .init.data:.got2 from prom_entry (offset 0x0) WARNING: arch/powerpc/platforms/built-in.o - Section mismatch: reference to .init.text:mpc8313_rdb_probe from .machine.desc after 'mach_mpc8313_rdb' (at offset 0x4) Signed-off-by: Li Yang <[EMAIL PROTECTED]> Signed-off-by: Sam Ravnborg <[EMAIL PROTECTED]> scripts/mod/modpost.c Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote: Siddha, Suresh B wrote: I can try 32-bit kernel to check. Don't bother. I just checked 2.6.22-rc3 and the problem is not present which means something between rc2 and rc3 has fixed the problem. I hate it when problems (appear to) fix themselves as it usually means they're just hiding. I didn't see any patches between rc2 and rc3 that were likely to have fixed this (but doesn't mean there wasn't one). I'm wondering whether I should do a git bisect to see if I can find where it got fixed? Could you see if you can reproduce it on 2.6.22-rc2? No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote system at office. 15 attempts didn't show the issue. Sure that nothing changed in your test setup? I just rechecked with an old kernel and the problem was still there. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote: Siddha, Suresh B wrote: I can try 32-bit kernel to check. Don't bother. I just checked 2.6.22-rc3 and the problem is not present which means something between rc2 and rc3 has fixed the problem. I hate it when problems (appear to) fix themselves as it usually means they're just hiding. I didn't see any patches between rc2 and rc3 that were likely to have fixed this (but doesn't mean there wasn't one). I'm wondering whether I should do a git bisect to see if I can find where it got fixed? Could you see if you can reproduce it on 2.6.22-rc2? No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote system at office. 15 attempts didn't show the issue. Sure that nothing changed in your test setup? More experiments tomorrow morning.. I've finished bisecting and the patch at which things appear to improve is cd5477911fc9f5cc64678e2b95cdd606c59a11b5 which is in the middle of a bunch of patches reorganizing the link phase of the build. Patch description is: kbuild: add Section mismatch warning whitelist for powerpc author Li Yang [EMAIL PROTECTED] Mon, 14 May 2007 10:04:28 + (18:04 +0800) committer Sam Ravnborg [EMAIL PROTECTED] Sat, 19 May 2007 07:11:57 + (09:11 +0200) commit cd5477911fc9f5cc64678e2b95cdd606c59a11b5 treed893f07b0040d36dfc60040dc695384e9afcf103tree | snapshot parent f892b7d480eec809a5dfbd6e65742b3f3155e50ecommit | diff kbuild: add Section mismatch warning whitelist for powerpc This patch fixes the following class of Section mismatch warnings when building powerpc platforms. WARNING: arch/powerpc/kernel/built-in.o - Section mismatch: reference to .init.data:.got2 from prom_entry (offset 0x0) WARNING: arch/powerpc/platforms/built-in.o - Section mismatch: reference to .init.text:mpc8313_rdb_probe from .machine.desc after 'mach_mpc8313_rdb' (at offset 0x4) Signed-off-by: Li Yang [EMAIL PROTECTED] Signed-off-by: Sam Ravnborg [EMAIL PROTECTED] scripts/mod/modpost.c Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote: Siddha, Suresh B wrote: I can try 32-bit kernel to check. Don't bother. I just checked 2.6.22-rc3 and the problem is not present which means something between rc2 and rc3 has fixed the problem. I hate it when problems (appear to) fix themselves as it usually means they're just hiding. I didn't see any patches between rc2 and rc3 that were likely to have fixed this (but doesn't mean there wasn't one). I'm wondering whether I should do a git bisect to see if I can find where it got fixed? Could you see if you can reproduce it on 2.6.22-rc2? No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote system at office. 15 attempts didn't show the issue. Sure that nothing changed in your test setup? I just rechecked with an old kernel and the problem was still there. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
William Lee Irwin III wrote: On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote: So what you're saying is that you think dynamic priority (or its equivalent) should be used for load balancing instead of static priority? It doesn't do much in other schemes, but when fairness is directly measured by the dynamic priority, it is a priori more meaningful. This is not to say the net effect of using it is so different. I suspect that while it's probably theoretically better it wouldn't make much difference on a real system (probably not enough to justify any extra complexity if there were any). The exception might be on systems where there were lots of CPU intensive tasks that used relatively large chunks of CPU each time they were runnable which would give the load balancer a more stable load to try and balance. It might be worth the extra effort to get it exactly right on those systems. On most normal systems this isn't the case and the load balancer is always playing catch up to a constantly changing scenario. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Tue, May 29, 2007 at 04:54:29PM -0700, Peter Williams wrote: I tried with various refresh rates of top too.. Do you see the issue at runlevel 3 too? I haven't tried that. Do your spinners ever relinquish the CPU voluntarily? Nope. Simple and plain while(1); 's I can try 32-bit kernel to check. Don't bother. I just checked 2.6.22-rc3 and the problem is not present which means something between rc2 and rc3 has fixed the problem. I hate it when problems (appear to) fix themselves as it usually means they're just hiding. I didn't see any patches between rc2 and rc3 that were likely to have fixed this (but doesn't mean there wasn't one). I'm wondering whether I should do a git bisect to see if I can find where it got fixed? Could you see if you can reproduce it on 2.6.22-rc2? Thanks Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
William Lee Irwin III wrote: William Lee Irwin III wrote: Lag should be considered in lieu of load because lag On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: What's the definition of lag here? Lag is the deviation of a task's allocated CPU time from the CPU time it would be granted by the ideal fair scheduling algorithm (generalized processor sharing; take the limit of RR with per-task timeslices proportional to load weight as the scale factor approaches zero). Over what time period does this operate? Negative lag reflects receipt of excess CPU time. A close-to-canonical "fairness metric" is the maximum of the absolute values of the lags of all the tasks on the system. The "signed minimax pseudonorm" is the largest lag without taking absolute values; it's a term I devised ad hoc to describe the proposed algorithm. So what you're saying is that you think dynamic priority (or its equivalent) should be used for load balancing instead of static priority? William Lee Irwin III wrote: is what the scheduler is trying to minimize; On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: This isn't always the case. Some may prefer fairness to minimal lag. Others may prefer particular tasks to receive preferential treatment. This comment does not apply. Generalized processor sharing expresses preferential treatment via weighting. Various other forms of preferential treatment require more elaborate idealized models. This was said before I realized that your "lag" is just a measure of fairness. load is not directly relevant, but appears to have some sort of relationship. Also, instead of pinned, unpinned should be considered. On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: If you have total and pinned you can get unpinned. It's probably cheaper to maintain data for pinned than unpinned as there's less of it on normal systems. Regardless of the underlying accounting, I was just replying to your criticism of my suggestion to keep pinned task statistics and use them. I've presented a coherent algorithm. It may be that there's no demonstrable problem to solve. On the other hand, if there really is a question as to how to load balance in the presence of tasks pinned to cpus, I just answered it. Unless I missed something there's nothing in your suggestion that does anything more about handling pinned tasks than is already done by the load balancer. William Lee Irwin III wrote: Using the signed minimax pseudonorm (i.e. the highest signed lag, where positive is higher than all negative regardless of magnitude) on unpinned lags yields a rather natural load balancing algorithm consisting of migrating from highest to lowest signed lag, with progressively longer periods for periodic balancing across progressively higher levels of hierarchy in sched_domains etc. as usual. Basically skip over pinned tasks as far as lag goes. The trick with all that comes when tasks are pinned within a set of cpus (especially crossing sched_domains) instead of to a single cpu. On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: Yes, this makes the cost of maintaining the required data higher which makes keeping pinned data more attractive than unpinned. BTW keeping data for sets of CPU affinities could cause problems as the number of possible sets is quite large (being 2 to the power of the number of CPUs). So you need an algorithm based on pinned data for single CPUs that knows the pinning isn't necessarily exclusive rather than one based on sets of CPUs. As I understand it (which may be wrong), the mechanism you describe below takes that approach. Yes, the mechanism I described takes that approach. William Lee Irwin III wrote: The smpnice affair is better phrased in terms of task weighting. It's simple to honor nice in such an arrangement. First unravel the grouping hierarchy, then weight by nice. This looks like [...] In such a manner nice numbers obey the principle of least surprise. On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: Is it just me or did you stray from the topic of handling cpu affinity during load balancing to hierarchical load balancing? I couldn't see anything in the above explanation that would improve the handling of cpu affinity. There was a second issue raised to which I responded. I didn't stray per se. I addressed a second topic in the post. OK. To reiterate, I don't think that my suggestion is really necessary. I think that the current load balancing (stand fast a small bug that's being investigated) will come up with a good distribution of tasks to CPUs within the constraints imposed by any CPU affinity settings. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the lin
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Thu, May 24, 2007 at 04:23:19PM -0700, Peter Williams wrote: Siddha, Suresh B wrote: On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote: Further testing indicates that CONFIG_SCHED_MC is not implicated and it's CONFIG_SCHED_SMT that's causing the problem. This rules out the code in find_busiest_group() as it is common to both macros. I think this makes the scheduling domain parameter values the most likely cause of the problem. I'm not very familiar with this code so I've added those who've modified this code in the last year or so to the address of this e-mail. What platform is this? I remember you mentioned its a 2 cpu box. Is it dual core or dual package or one with HT? It's a single CPU HT box i.e. 2 virtual CPUs. "cat /proc/cpuinfo" produces: Peter, I tried on a similar box and couldn't reproduce this problem with x86_64 Mine's a 32 bit machine. 2.6.22-rc3 kernel I haven't tried rc3 yet. and using defconfig(has SCHED_SMT turned on). I am using top and just the spinners. I don't have gkrellm running, is that required to reproduce the issue? Not necessarily. But you may need to do a number of trials as sheer chance plays a part. I tried number of times and also in runlevels 3,5(with top running in a xterm incase of runlevel 5). I've always done it in run level 5 using gnome-terminal. I use 10 consecutive trials without seeing the problem as an indication of its absence but will cut that short if I see a 3/1 which quickly recovers (see below). In runlevel 5, occasionally for one refresh screen of top, I see three spinners on one cpu and one spinner on other(with X or someother app also on the cpu with one spinner). But it balances nicely for the immd next refresh of the top screen. Yes, that (the fact that it recovers quickly) confirms that the problem isn't present for your system. If load balancing occurs when other tasks than the spinners are actually running a 1/3 split for the spinners is a reasonable outcome so seeing the occasional 1/3 split is OK but it should return to 2/2 as soon as the other tasks sleep. When I'm doing my tests (for the various combinations of macros) I always count a case where I see a 3/1 split that quickly recovers as proof that this problem isn't present for that case and cease testing. I tried with various refresh rates of top too.. Do you see the issue at runlevel 3 too? I haven't tried that. Do your spinners ever relinquish the CPU voluntarily? Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Thu, May 24, 2007 at 04:23:19PM -0700, Peter Williams wrote: Siddha, Suresh B wrote: On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote: Further testing indicates that CONFIG_SCHED_MC is not implicated and it's CONFIG_SCHED_SMT that's causing the problem. This rules out the code in find_busiest_group() as it is common to both macros. I think this makes the scheduling domain parameter values the most likely cause of the problem. I'm not very familiar with this code so I've added those who've modified this code in the last year or so to the address of this e-mail. What platform is this? I remember you mentioned its a 2 cpu box. Is it dual core or dual package or one with HT? It's a single CPU HT box i.e. 2 virtual CPUs. cat /proc/cpuinfo produces: Peter, I tried on a similar box and couldn't reproduce this problem with x86_64 Mine's a 32 bit machine. 2.6.22-rc3 kernel I haven't tried rc3 yet. and using defconfig(has SCHED_SMT turned on). I am using top and just the spinners. I don't have gkrellm running, is that required to reproduce the issue? Not necessarily. But you may need to do a number of trials as sheer chance plays a part. I tried number of times and also in runlevels 3,5(with top running in a xterm incase of runlevel 5). I've always done it in run level 5 using gnome-terminal. I use 10 consecutive trials without seeing the problem as an indication of its absence but will cut that short if I see a 3/1 which quickly recovers (see below). In runlevel 5, occasionally for one refresh screen of top, I see three spinners on one cpu and one spinner on other(with X or someother app also on the cpu with one spinner). But it balances nicely for the immd next refresh of the top screen. Yes, that (the fact that it recovers quickly) confirms that the problem isn't present for your system. If load balancing occurs when other tasks than the spinners are actually running a 1/3 split for the spinners is a reasonable outcome so seeing the occasional 1/3 split is OK but it should return to 2/2 as soon as the other tasks sleep. When I'm doing my tests (for the various combinations of macros) I always count a case where I see a 3/1 split that quickly recovers as proof that this problem isn't present for that case and cease testing. I tried with various refresh rates of top too.. Do you see the issue at runlevel 3 too? I haven't tried that. Do your spinners ever relinquish the CPU voluntarily? Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
William Lee Irwin III wrote: William Lee Irwin III wrote: Lag should be considered in lieu of load because lag On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: What's the definition of lag here? Lag is the deviation of a task's allocated CPU time from the CPU time it would be granted by the ideal fair scheduling algorithm (generalized processor sharing; take the limit of RR with per-task timeslices proportional to load weight as the scale factor approaches zero). Over what time period does this operate? Negative lag reflects receipt of excess CPU time. A close-to-canonical fairness metric is the maximum of the absolute values of the lags of all the tasks on the system. The signed minimax pseudonorm is the largest lag without taking absolute values; it's a term I devised ad hoc to describe the proposed algorithm. So what you're saying is that you think dynamic priority (or its equivalent) should be used for load balancing instead of static priority? William Lee Irwin III wrote: is what the scheduler is trying to minimize; On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: This isn't always the case. Some may prefer fairness to minimal lag. Others may prefer particular tasks to receive preferential treatment. This comment does not apply. Generalized processor sharing expresses preferential treatment via weighting. Various other forms of preferential treatment require more elaborate idealized models. This was said before I realized that your lag is just a measure of fairness. load is not directly relevant, but appears to have some sort of relationship. Also, instead of pinned, unpinned should be considered. On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: If you have total and pinned you can get unpinned. It's probably cheaper to maintain data for pinned than unpinned as there's less of it on normal systems. Regardless of the underlying accounting, I was just replying to your criticism of my suggestion to keep pinned task statistics and use them. I've presented a coherent algorithm. It may be that there's no demonstrable problem to solve. On the other hand, if there really is a question as to how to load balance in the presence of tasks pinned to cpus, I just answered it. Unless I missed something there's nothing in your suggestion that does anything more about handling pinned tasks than is already done by the load balancer. William Lee Irwin III wrote: Using the signed minimax pseudonorm (i.e. the highest signed lag, where positive is higher than all negative regardless of magnitude) on unpinned lags yields a rather natural load balancing algorithm consisting of migrating from highest to lowest signed lag, with progressively longer periods for periodic balancing across progressively higher levels of hierarchy in sched_domains etc. as usual. Basically skip over pinned tasks as far as lag goes. The trick with all that comes when tasks are pinned within a set of cpus (especially crossing sched_domains) instead of to a single cpu. On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: Yes, this makes the cost of maintaining the required data higher which makes keeping pinned data more attractive than unpinned. BTW keeping data for sets of CPU affinities could cause problems as the number of possible sets is quite large (being 2 to the power of the number of CPUs). So you need an algorithm based on pinned data for single CPUs that knows the pinning isn't necessarily exclusive rather than one based on sets of CPUs. As I understand it (which may be wrong), the mechanism you describe below takes that approach. Yes, the mechanism I described takes that approach. William Lee Irwin III wrote: The smpnice affair is better phrased in terms of task weighting. It's simple to honor nice in such an arrangement. First unravel the grouping hierarchy, then weight by nice. This looks like [...] In such a manner nice numbers obey the principle of least surprise. On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote: Is it just me or did you stray from the topic of handling cpu affinity during load balancing to hierarchical load balancing? I couldn't see anything in the above explanation that would improve the handling of cpu affinity. There was a second issue raised to which I responded. I didn't stray per se. I addressed a second topic in the post. OK. To reiterate, I don't think that my suggestion is really necessary. I think that the current load balancing (stand fast a small bug that's being investigated) will come up with a good distribution of tasks to CPUs within the constraints imposed by any CPU affinity settings. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Tue, May 29, 2007 at 04:54:29PM -0700, Peter Williams wrote: I tried with various refresh rates of top too.. Do you see the issue at runlevel 3 too? I haven't tried that. Do your spinners ever relinquish the CPU voluntarily? Nope. Simple and plain while(1); 's I can try 32-bit kernel to check. Don't bother. I just checked 2.6.22-rc3 and the problem is not present which means something between rc2 and rc3 has fixed the problem. I hate it when problems (appear to) fix themselves as it usually means they're just hiding. I didn't see any patches between rc2 and rc3 that were likely to have fixed this (but doesn't mean there wasn't one). I'm wondering whether I should do a git bisect to see if I can find where it got fixed? Could you see if you can reproduce it on 2.6.22-rc2? Thanks Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
William Lee Irwin III wrote: On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote: So what you're saying is that you think dynamic priority (or its equivalent) should be used for load balancing instead of static priority? It doesn't do much in other schemes, but when fairness is directly measured by the dynamic priority, it is a priori more meaningful. This is not to say the net effect of using it is so different. I suspect that while it's probably theoretically better it wouldn't make much difference on a real system (probably not enough to justify any extra complexity if there were any). The exception might be on systems where there were lots of CPU intensive tasks that used relatively large chunks of CPU each time they were runnable which would give the load balancer a more stable load to try and balance. It might be worth the extra effort to get it exactly right on those systems. On most normal systems this isn't the case and the load balancer is always playing catch up to a constantly changing scenario. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
Peter Williams wrote: Srivatsa Vaddagiri wrote: On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote: I don't think that ignoring cpu affinity is an option. Setting the cpu affinity of tasks is a deliberate policy action on the part of the system administrator and has to be honoured. mmm ..but users can set cpu affinity w/o administrator priveleges .. OK. So you have to assume the users know what they're doing. :-) In reality though, the policy of allowing ordinary users to set affinity on their tasks should be rethought. After more contemplation, I now think I may have gone overboard here. I am now of the opinion that any degradation of overall system performance due to the use of cpu affinity would be confined to the tasks with cpu affinity set. So there's no need to prevent ordinary users from setting cpu affinity on their own processes as any degradation will only affect them. So it goes back to the situation where you have to assume that they know what they're doing and obey their policy. In any case, there's no point having cpu affinity if it's going to be ignored. Maybe you could have two levels of affinity: 1. if set by a root it must be obeyed; and 2. if set by an ordinary user it can be overridden if the best interests of the system dictate. BUT I think that would be a bad idea. This idea is now not just bad but unnecessary. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
Srivatsa Vaddagiri wrote: On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote: I don't think that ignoring cpu affinity is an option. Setting the cpu affinity of tasks is a deliberate policy action on the part of the system administrator and has to be honoured. mmm ..but users can set cpu affinity w/o administrator priveleges .. OK. So you have to assume the users know what they're doing. :-) In reality though, the policy of allowing ordinary users to set affinity on their tasks should be rethought. In any case, there's no point having cpu affinity if it's going to be ignored. Maybe you could have two levels of affinity: 1. if set by a root it must be obeyed; and 2. if set by an ordinary user it can be overridden if the best interests of the system dictate. BUT I think that would be a bad idea. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
Srivatsa Vaddagiri wrote: On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote: I don't think that ignoring cpu affinity is an option. Setting the cpu affinity of tasks is a deliberate policy action on the part of the system administrator and has to be honoured. mmm ..but users can set cpu affinity w/o administrator priveleges .. OK. So you have to assume the users know what they're doing. :-) In reality though, the policy of allowing ordinary users to set affinity on their tasks should be rethought. In any case, there's no point having cpu affinity if it's going to be ignored. Maybe you could have two levels of affinity: 1. if set by a root it must be obeyed; and 2. if set by an ordinary user it can be overridden if the best interests of the system dictate. BUT I think that would be a bad idea. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
Peter Williams wrote: Srivatsa Vaddagiri wrote: On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote: I don't think that ignoring cpu affinity is an option. Setting the cpu affinity of tasks is a deliberate policy action on the part of the system administrator and has to be honoured. mmm ..but users can set cpu affinity w/o administrator priveleges .. OK. So you have to assume the users know what they're doing. :-) In reality though, the policy of allowing ordinary users to set affinity on their tasks should be rethought. After more contemplation, I now think I may have gone overboard here. I am now of the opinion that any degradation of overall system performance due to the use of cpu affinity would be confined to the tasks with cpu affinity set. So there's no need to prevent ordinary users from setting cpu affinity on their own processes as any degradation will only affect them. So it goes back to the situation where you have to assume that they know what they're doing and obey their policy. In any case, there's no point having cpu affinity if it's going to be ignored. Maybe you could have two levels of affinity: 1. if set by a root it must be obeyed; and 2. if set by an ordinary user it can be overridden if the best interests of the system dictate. BUT I think that would be a bad idea. This idea is now not just bad but unnecessary. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
William Lee Irwin III wrote: Srivatsa Vaddagiri wrote: Ingo/Peter, any thoughts here? CFS and smpnice probably is "broken" with respect to such example as above albeit for nice-based tasks. On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote: See above. I think that faced with cpu affinity use by the system administrator that smpnice will tend towards a task to cpu allocation that is (close to) the best that can be achieved without violating the cpu affinity assignments. (It may take a little longer than normal but it should get there eventually.) You have to assume that the system administrator knows what (s)he's doing and is willing to accept the impact of their policy decision on the overall system performance. Having said that, if it was deemed necessary you could probably increase the speed at which the load balancer converged on a good result in the face of cpu affinity by keeping a "pinned weighted load" value for each run queue and using that to modify find_busiest_group() and find_busiest_queue() to be a bit smarter. But I'm not sure that it would be worth the added complexity. Just in case anyone was looking for algorithms... Lag should be considered in lieu of load because lag What's the definition of lag here? is what the scheduler is trying to minimize; This isn't always the case. Some may prefer fairness to minimal lag. Others may prefer particular tasks to receive preferential treatment. load is not directly relevant, but appears to have some sort of relationship. Also, instead of pinned, unpinned should be considered. If you have total and pinned you can get unpinned. It's probably cheaper to maintain data for pinned than unpinned as there's less of it on normal systems. It's unpinned that load balancing can actually migrate. True but see previous comment. Using the signed minimax pseudonorm (i.e. the highest signed lag, where positive is higher than all negative regardless of magnitude) on unpinned lags yields a rather natural load balancing algorithm consisting of migrating from highest to lowest signed lag, with progressively longer periods for periodic balancing across progressively higher levels of hierarchy in sched_domains etc. as usual. Basically skip over pinned tasks as far as lag goes. The trick with all that comes when tasks are pinned within a set of cpus (especially crossing sched_domains) instead of to a single cpu. Yes, this makes the cost of maintaining the required data higher which makes keeping pinned data more attractive than unpinned. BTW keeping data for sets of CPU affinities could cause problems as the number of possible sets is quite large (being 2 to the power of the number of CPUs). So you need an algorithm based on pinned data for single CPUs that knows the pinning isn't necessarily exclusive rather than one based on sets of CPUs. As I understand it (which may be wrong), the mechanism you describe below takes that approach. There one can just consider a cpu to enter a periodic load balance cycle, and then consider pushing and pulling, perhaps what could be called the "exchange lags" for the pair of cpus. That would be the minimax lag pseudonorms for the tasks migratable to both cpus of the pair. That makes the notion of moving things from highest to lowest lag (where load is now considered) unambiguous apart from whether all this converges, but not when to actually try to load balance vs. when not to, or when it's urgent vs. when it should be done periodically. To clarify that, an O(cpus**2) notion appears to be necessary, namely the largest exchange lag differential between any pair of cpus. There is also the open question of whether moving tasks between cpus with the highest exchange lag differential will actually reduce it or whether it runs the risk of increasing it by creating a larger exchange lag differential between different pairs of cpus. A similar open question is raised by localizing balancing decisions to sched_domains. What remains clear is that any such movement reduces the worst-case lag in the whole system. Because of that, the worst-case lag in the whole system monotonically decreases as balancing decisions are made, and that much is subject to an infinite descent argument. Unfortunately, determining the largest exchange lag differential appears to be more complex than merely finding the highest and lowest lags. Bipartite forms of the problem also arise from sched_domains. I doubt anyone's really paying any sort of attention, so I'll not really bother working out much more in the way of details with respect to load balancing. It may be that there are better ways to communicate algorithmic notions than prose descriptions. However, it's doubtful I'll produce anything in a timely enough fashion to attract or hold interest. The smpnice affair is better phrased in terms of task weighting. It's simple to honor nice in such an arrangeme
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
William Lee Irwin III wrote: Srivatsa Vaddagiri wrote: Ingo/Peter, any thoughts here? CFS and smpnice probably is broken with respect to such example as above albeit for nice-based tasks. On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote: See above. I think that faced with cpu affinity use by the system administrator that smpnice will tend towards a task to cpu allocation that is (close to) the best that can be achieved without violating the cpu affinity assignments. (It may take a little longer than normal but it should get there eventually.) You have to assume that the system administrator knows what (s)he's doing and is willing to accept the impact of their policy decision on the overall system performance. Having said that, if it was deemed necessary you could probably increase the speed at which the load balancer converged on a good result in the face of cpu affinity by keeping a pinned weighted load value for each run queue and using that to modify find_busiest_group() and find_busiest_queue() to be a bit smarter. But I'm not sure that it would be worth the added complexity. Just in case anyone was looking for algorithms... Lag should be considered in lieu of load because lag What's the definition of lag here? is what the scheduler is trying to minimize; This isn't always the case. Some may prefer fairness to minimal lag. Others may prefer particular tasks to receive preferential treatment. load is not directly relevant, but appears to have some sort of relationship. Also, instead of pinned, unpinned should be considered. If you have total and pinned you can get unpinned. It's probably cheaper to maintain data for pinned than unpinned as there's less of it on normal systems. It's unpinned that load balancing can actually migrate. True but see previous comment. Using the signed minimax pseudonorm (i.e. the highest signed lag, where positive is higher than all negative regardless of magnitude) on unpinned lags yields a rather natural load balancing algorithm consisting of migrating from highest to lowest signed lag, with progressively longer periods for periodic balancing across progressively higher levels of hierarchy in sched_domains etc. as usual. Basically skip over pinned tasks as far as lag goes. The trick with all that comes when tasks are pinned within a set of cpus (especially crossing sched_domains) instead of to a single cpu. Yes, this makes the cost of maintaining the required data higher which makes keeping pinned data more attractive than unpinned. BTW keeping data for sets of CPU affinities could cause problems as the number of possible sets is quite large (being 2 to the power of the number of CPUs). So you need an algorithm based on pinned data for single CPUs that knows the pinning isn't necessarily exclusive rather than one based on sets of CPUs. As I understand it (which may be wrong), the mechanism you describe below takes that approach. There one can just consider a cpu to enter a periodic load balance cycle, and then consider pushing and pulling, perhaps what could be called the exchange lags for the pair of cpus. That would be the minimax lag pseudonorms for the tasks migratable to both cpus of the pair. That makes the notion of moving things from highest to lowest lag (where load is now considered) unambiguous apart from whether all this converges, but not when to actually try to load balance vs. when not to, or when it's urgent vs. when it should be done periodically. To clarify that, an O(cpus**2) notion appears to be necessary, namely the largest exchange lag differential between any pair of cpus. There is also the open question of whether moving tasks between cpus with the highest exchange lag differential will actually reduce it or whether it runs the risk of increasing it by creating a larger exchange lag differential between different pairs of cpus. A similar open question is raised by localizing balancing decisions to sched_domains. What remains clear is that any such movement reduces the worst-case lag in the whole system. Because of that, the worst-case lag in the whole system monotonically decreases as balancing decisions are made, and that much is subject to an infinite descent argument. Unfortunately, determining the largest exchange lag differential appears to be more complex than merely finding the highest and lowest lags. Bipartite forms of the problem also arise from sched_domains. I doubt anyone's really paying any sort of attention, so I'll not really bother working out much more in the way of details with respect to load balancing. It may be that there are better ways to communicate algorithmic notions than prose descriptions. However, it's doubtful I'll produce anything in a timely enough fashion to attract or hold interest. The smpnice affair is better phrased in terms of task weighting. It's simple to honor nice in such an arrangement. First unravel the grouping hierarchy
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
Srivatsa Vaddagiri wrote: Good example :) USER2's single task will have to share its CPU with USER1's 50 tasks (unless we modify the smpnice load balancer to disregard cpu affinity that is - which I would not prefer to do). I don't think that ignoring cpu affinity is an option. Setting the cpu affinity of tasks is a deliberate policy action on the part of the system administrator and has to be honoured. Load balancing has to do the best it can in these circumstances which may mean sub optimal distribution of the load BUT it is result of a deliberate policy decision by the system administrator. Ingo/Peter, any thoughts here? CFS and smpnice probably is "broken" with respect to such example as above albeit for nice-based tasks. See above. I think that faced with cpu affinity use by the system administrator that smpnice will tend towards a task to cpu allocation that is (close to) the best that can be achieved without violating the cpu affinity assignments. (It may take a little longer than normal but it should get there eventually.) You have to assume that the system administrator knows what (s)he's doing and is willing to accept the impact of their policy decision on the overall system performance. Having said that, if it was deemed necessary you could probably increase the speed at which the load balancer converged on a good result in the face of cpu affinity by keeping a "pinned weighted load" value for each run queue and using that to modify find_busiest_group() and find_busiest_queue() to be a bit smarter. But I'm not sure that it would be worth the added complexity. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
Srivatsa Vaddagiri wrote: Good example :) USER2's single task will have to share its CPU with USER1's 50 tasks (unless we modify the smpnice load balancer to disregard cpu affinity that is - which I would not prefer to do). I don't think that ignoring cpu affinity is an option. Setting the cpu affinity of tasks is a deliberate policy action on the part of the system administrator and has to be honoured. Load balancing has to do the best it can in these circumstances which may mean sub optimal distribution of the load BUT it is result of a deliberate policy decision by the system administrator. Ingo/Peter, any thoughts here? CFS and smpnice probably is broken with respect to such example as above albeit for nice-based tasks. See above. I think that faced with cpu affinity use by the system administrator that smpnice will tend towards a task to cpu allocation that is (close to) the best that can be achieved without violating the cpu affinity assignments. (It may take a little longer than normal but it should get there eventually.) You have to assume that the system administrator knows what (s)he's doing and is willing to accept the impact of their policy decision on the overall system performance. Having said that, if it was deemed necessary you could probably increase the speed at which the load balancer converged on a good result in the face of cpu affinity by keeping a pinned weighted load value for each run queue and using that to modify find_busiest_group() and find_busiest_queue() to be a bit smarter. But I'm not sure that it would be worth the added complexity. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote: Peter Williams wrote: The relevant code, find_busiest_group() and find_busiest_queue(), has a lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, as these macros were defined in the kernels I was testing with, I built a kernel with these macros undefined and reran my tests. The problems/anomalies were not present in 10 consecutive tests on this new kernel. Even better on the few occasions that a 3/1 split did occur it was quickly corrected to 2/2 and top was reporting approx 49% of CPU for all spinners throughout each of the ten tests. So all that is required now is an analysis of the code inside the ifdefs to see why it is causing a problem. Further testing indicates that CONFIG_SCHED_MC is not implicated and it's CONFIG_SCHED_SMT that's causing the problem. This rules out the code in find_busiest_group() as it is common to both macros. I think this makes the scheduling domain parameter values the most likely cause of the problem. I'm not very familiar with this code so I've added those who've modified this code in the last year or so to the address of this e-mail. What platform is this? I remember you mentioned its a 2 cpu box. Is it dual core or dual package or one with HT? It's a single CPU HT box i.e. 2 virtual CPUs. "cat /proc/cpuinfo" produces: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping: 4 cpu MHz : 3201.145 cache size : 1024 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 1 fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pni monitor ds_cpl cid xtpr bogomips: 6403.97 clflush size: 64 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping: 4 cpu MHz : 3201.145 cache size : 1024 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 1 fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pni monitor ds_cpl cid xtpr bogomips: 6400.92 clflush size: 64 Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Peter Williams wrote: Peter Williams wrote: Peter Williams wrote: Dmitry Adamushko wrote: On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. all 4 spinners "tend" to be on CPU0 (and as I understand each gets ~25% approx.?), so there must be plenty of moments for *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume together just a few % of CPU. Hence, we should not be that dependent on the load balancing interval here.. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. For no good reason other than curiosity, I tried a variation of this experiment where I reniced the spinners to 10 instead of -10 and, to my surprise, they were allocated 2/2 to the CPUs on average. I say on average because the allocations were a little more volatile and occasionally 0/4 splits would occur but these would last for less than one top cycle before the 2/2 was re-established. The quickness of these recoveries would indicate that it was most likely the idle balance mechanism that restored the balance. This may point the finger at the tick based load balance mechanism being too conservative The relevant code, find_busiest_group() and find_busiest_queue(), has a lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, as these macros were defined in the kernels I was testing with, I built a kernel with these macros undefined and reran my tests. The problems/anomalies were not present in 10 consecutive tests on this new kernel. Even better on the few occasions that a 3/1 split did occur it was quickly corrected to 2/2 and top was reporting approx 49% of CPU for all spinners throughout each of the ten tests. So all that is required now is an analysis of the code inside the ifdefs to see why it is causing a problem. Further testing indicates that CONFIG_SCHED_MC is not implicated and it's CONFIG_SCHED_SMT that's causing the problem. This rules out the code in find_busiest_group() as it is common to both macros. I think this makes the scheduling domain parameter values the most likely cause of the problem. I'm not very familiar with this code so I've added those who've modified this code in the last year or so to the address of this e-mail. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Peter Williams wrote: Peter Williams wrote: Peter Williams wrote: Dmitry Adamushko wrote: On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. all 4 spinners tend to be on CPU0 (and as I understand each gets ~25% approx.?), so there must be plenty of moments for *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume together just a few % of CPU. Hence, we should not be that dependent on the load balancing interval here.. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. For no good reason other than curiosity, I tried a variation of this experiment where I reniced the spinners to 10 instead of -10 and, to my surprise, they were allocated 2/2 to the CPUs on average. I say on average because the allocations were a little more volatile and occasionally 0/4 splits would occur but these would last for less than one top cycle before the 2/2 was re-established. The quickness of these recoveries would indicate that it was most likely the idle balance mechanism that restored the balance. This may point the finger at the tick based load balance mechanism being too conservative The relevant code, find_busiest_group() and find_busiest_queue(), has a lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, as these macros were defined in the kernels I was testing with, I built a kernel with these macros undefined and reran my tests. The problems/anomalies were not present in 10 consecutive tests on this new kernel. Even better on the few occasions that a 3/1 split did occur it was quickly corrected to 2/2 and top was reporting approx 49% of CPU for all spinners throughout each of the ten tests. So all that is required now is an analysis of the code inside the ifdefs to see why it is causing a problem. Further testing indicates that CONFIG_SCHED_MC is not implicated and it's CONFIG_SCHED_SMT that's causing the problem. This rules out the code in find_busiest_group() as it is common to both macros. I think this makes the scheduling domain parameter values the most likely cause of the problem. I'm not very familiar with this code so I've added those who've modified this code in the last year or so to the address of this e-mail. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Siddha, Suresh B wrote: On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote: Peter Williams wrote: The relevant code, find_busiest_group() and find_busiest_queue(), has a lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, as these macros were defined in the kernels I was testing with, I built a kernel with these macros undefined and reran my tests. The problems/anomalies were not present in 10 consecutive tests on this new kernel. Even better on the few occasions that a 3/1 split did occur it was quickly corrected to 2/2 and top was reporting approx 49% of CPU for all spinners throughout each of the ten tests. So all that is required now is an analysis of the code inside the ifdefs to see why it is causing a problem. Further testing indicates that CONFIG_SCHED_MC is not implicated and it's CONFIG_SCHED_SMT that's causing the problem. This rules out the code in find_busiest_group() as it is common to both macros. I think this makes the scheduling domain parameter values the most likely cause of the problem. I'm not very familiar with this code so I've added those who've modified this code in the last year or so to the address of this e-mail. What platform is this? I remember you mentioned its a 2 cpu box. Is it dual core or dual package or one with HT? It's a single CPU HT box i.e. 2 virtual CPUs. cat /proc/cpuinfo produces: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping: 4 cpu MHz : 3201.145 cache size : 1024 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 1 fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pni monitor ds_cpl cid xtpr bogomips: 6403.97 clflush size: 64 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping: 4 cpu MHz : 3201.145 cache size : 1024 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 1 fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pni monitor ds_cpl cid xtpr bogomips: 6400.92 clflush size: 64 Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Dmitry Adamushko wrote: On 22/05/07, Peter Williams <[EMAIL PROTECTED]> wrote: > [...] > Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. Well, I just took one of your calculated probabilities as something you have really observed - (*) below. "The probabilities for the 3 split possibilities for random allocation are: 2/2 (the desired outcome) is 3/8 likely, 1/3 is 4/8 likely, and 0/4 is 1/8 likely.<-- (*) " These are the theoretical probabilities for the outcomes based on the random allocation of 4 tasks to 2 CPUs. There are, in fact, 16 different ways that 4 tasks can be assigned to 2 CPUs. 6 of these result in a 2/2 split, 8 in a 1/3 split and 2 in a 0/4 split. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. Yes. That said, idle_balance() is out of work in this case. Which is why I reported the problem. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. I wonder what would happen if X gets reniced to -10 instead (and spinners are at 0).. I guess, something I described in my previous mail (and dubbed "unlikely cospiracy" :) could happen, i.e. 0/4 and then idle_balance() comes into play.. Probably the same as I observed but it's easier to renice the spinners. I see the 0/4 split for brief moments if I renice the spinners to 10 instead of -10 but the idle balancer quickly restores it to 2/2. ok, I see. You have probably achieved a similar effect with the spinners being reniced to 10 (but here both "X" and "top" gain additional "weight" wrt the load balancing). I'm playing with some jitter experiments at the moment. The amount of jitter needs to be small (a few tenths of a second) as the synchronization (if it's happening) is happening at the seconds level as the intervals for top and gkrellm will be in the 1 to 5 second range (I guess -- I haven't checked) and the load balancing is every 60 seconds. Hum.. the "every 60 seconds" part puzzles me quite a bit. Looking at the run_rebalance_domain(), I'd say that it's normally overwritten by the following code if (time_after(next_balance, sd->last_balance + interval)) next_balance = sd->last_balance + interval; the "interval" seems to be *normally* shorter than "60*HZ" (according to the default params in topology.h).. moreover, in case of the CFS if (interval > HZ*NR_CPUS/10) interval = HZ*NR_CPUS/10; so it can't be > 0.2 HZ in your case (== once in 200 ms at max with HZ=1000).. am I missing something? TIA No, I did. But it's all academic as my synchronization theory is now dead -- see separate e-mail. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Peter Williams wrote: Peter Williams wrote: Dmitry Adamushko wrote: On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. all 4 spinners "tend" to be on CPU0 (and as I understand each gets ~25% approx.?), so there must be plenty of moments for *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume together just a few % of CPU. Hence, we should not be that dependent on the load balancing interval here.. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. For no good reason other than curiosity, I tried a variation of this experiment where I reniced the spinners to 10 instead of -10 and, to my surprise, they were allocated 2/2 to the CPUs on average. I say on average because the allocations were a little more volatile and occasionally 0/4 splits would occur but these would last for less than one top cycle before the 2/2 was re-established. The quickness of these recoveries would indicate that it was most likely the idle balance mechanism that restored the balance. This may point the finger at the tick based load balance mechanism being too conservative The relevant code, find_busiest_group() and find_busiest_queue(), has a lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, as these macros were defined in the kernels I was testing with, I built a kernel with these macros undefined and reran my tests. The problems/anomalies were not present in 10 consecutive tests on this new kernel. Even better on the few occasions that a 3/1 split did occur it was quickly corrected to 2/2 and top was reporting approx 49% of CPU for all spinners throughout each of the ten tests. So all that is required now is an analysis of the code inside the ifdefs to see why it is causing a problem. in when it decides whether tasks need to be moved. In the case where the spinners are at nice == 0, the idle balance mechanism never comes into play as the 0/4 split is never seen so only the tick based mechanism is in force in this case and this is where the anomalies are seen. This tick rebalance mechanism only situation is also true for the nice == -10 case but in this case the high load weights of the spinners overcomes the tick based load balancing mechanism's conservatism e.g. the difference in queue loads for a 1/3 split in this case is the equivalent to the difference that would be generated by an imbalance of about 18 nice == 0 spinners i.e. too big to be ignored. The evidence seems to indicate that IF a rebalance operation gets initiated then the right amount of load will get moved. This new evidence weakens (but does not totally destroy) my synchronization (a.k.a. conspiracy) theory. My synchronization theory is now dead. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Dmitry Adamushko wrote: On 22/05/07, Peter Williams [EMAIL PROTECTED] wrote: [...] Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. Well, I just took one of your calculated probabilities as something you have really observed - (*) below. The probabilities for the 3 split possibilities for random allocation are: 2/2 (the desired outcome) is 3/8 likely, 1/3 is 4/8 likely, and 0/4 is 1/8 likely.-- (*) These are the theoretical probabilities for the outcomes based on the random allocation of 4 tasks to 2 CPUs. There are, in fact, 16 different ways that 4 tasks can be assigned to 2 CPUs. 6 of these result in a 2/2 split, 8 in a 1/3 split and 2 in a 0/4 split. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. Yes. That said, idle_balance() is out of work in this case. Which is why I reported the problem. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. I wonder what would happen if X gets reniced to -10 instead (and spinners are at 0).. I guess, something I described in my previous mail (and dubbed unlikely cospiracy :) could happen, i.e. 0/4 and then idle_balance() comes into play.. Probably the same as I observed but it's easier to renice the spinners. I see the 0/4 split for brief moments if I renice the spinners to 10 instead of -10 but the idle balancer quickly restores it to 2/2. ok, I see. You have probably achieved a similar effect with the spinners being reniced to 10 (but here both X and top gain additional weight wrt the load balancing). I'm playing with some jitter experiments at the moment. The amount of jitter needs to be small (a few tenths of a second) as the synchronization (if it's happening) is happening at the seconds level as the intervals for top and gkrellm will be in the 1 to 5 second range (I guess -- I haven't checked) and the load balancing is every 60 seconds. Hum.. the every 60 seconds part puzzles me quite a bit. Looking at the run_rebalance_domain(), I'd say that it's normally overwritten by the following code if (time_after(next_balance, sd-last_balance + interval)) next_balance = sd-last_balance + interval; the interval seems to be *normally* shorter than 60*HZ (according to the default params in topology.h).. moreover, in case of the CFS if (interval HZ*NR_CPUS/10) interval = HZ*NR_CPUS/10; so it can't be 0.2 HZ in your case (== once in 200 ms at max with HZ=1000).. am I missing something? TIA No, I did. But it's all academic as my synchronization theory is now dead -- see separate e-mail. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Peter Williams wrote: Peter Williams wrote: Dmitry Adamushko wrote: On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. all 4 spinners tend to be on CPU0 (and as I understand each gets ~25% approx.?), so there must be plenty of moments for *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume together just a few % of CPU. Hence, we should not be that dependent on the load balancing interval here.. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. For no good reason other than curiosity, I tried a variation of this experiment where I reniced the spinners to 10 instead of -10 and, to my surprise, they were allocated 2/2 to the CPUs on average. I say on average because the allocations were a little more volatile and occasionally 0/4 splits would occur but these would last for less than one top cycle before the 2/2 was re-established. The quickness of these recoveries would indicate that it was most likely the idle balance mechanism that restored the balance. This may point the finger at the tick based load balance mechanism being too conservative The relevant code, find_busiest_group() and find_busiest_queue(), has a lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, as these macros were defined in the kernels I was testing with, I built a kernel with these macros undefined and reran my tests. The problems/anomalies were not present in 10 consecutive tests on this new kernel. Even better on the few occasions that a 3/1 split did occur it was quickly corrected to 2/2 and top was reporting approx 49% of CPU for all spinners throughout each of the ten tests. So all that is required now is an analysis of the code inside the ifdefs to see why it is causing a problem. in when it decides whether tasks need to be moved. In the case where the spinners are at nice == 0, the idle balance mechanism never comes into play as the 0/4 split is never seen so only the tick based mechanism is in force in this case and this is where the anomalies are seen. This tick rebalance mechanism only situation is also true for the nice == -10 case but in this case the high load weights of the spinners overcomes the tick based load balancing mechanism's conservatism e.g. the difference in queue loads for a 1/3 split in this case is the equivalent to the difference that would be generated by an imbalance of about 18 nice == 0 spinners i.e. too big to be ignored. The evidence seems to indicate that IF a rebalance operation gets initiated then the right amount of load will get moved. This new evidence weakens (but does not totally destroy) my synchronization (a.k.a. conspiracy) theory. My synchronization theory is now dead. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Peter Williams wrote: Dmitry Adamushko wrote: On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. all 4 spinners "tend" to be on CPU0 (and as I understand each gets ~25% approx.?), so there must be plenty of moments for *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume together just a few % of CPU. Hence, we should not be that dependent on the load balancing interval here.. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. For no good reason other than curiosity, I tried a variation of this experiment where I reniced the spinners to 10 instead of -10 and, to my surprise, they were allocated 2/2 to the CPUs on average. I say on average because the allocations were a little more volatile and occasionally 0/4 splits would occur but these would last for less than one top cycle before the 2/2 was re-established. The quickness of these recoveries would indicate that it was most likely the idle balance mechanism that restored the balance. This may point the finger at the tick based load balance mechanism being too conservative in when it decides whether tasks need to be moved. In the case where the spinners are at nice == 0, the idle balance mechanism never comes into play as the 0/4 split is never seen so only the tick based mechanism is in force in this case and this is where the anomalies are seen. This tick rebalance mechanism only situation is also true for the nice == -10 case but in this case the high load weights of the spinners overcomes the tick based load balancing mechanism's conservatism e.g. the difference in queue loads for a 1/3 split in this case is the equivalent to the difference that would be generated by an imbalance of about 18 nice == 0 spinners i.e. too big to be ignored. The evidence seems to indicate that IF a rebalance operation gets initiated then the right amount of load will get moved. This new evidence weakens (but does not totally destroy) my synchronization (a.k.a. conspiracy) theory. Peter PS As the total load weight for 4 nice == 10 tasks is only about 40% of the load weight of a single nice == 0 task, the occasional 0/4 split in the spinners at nice == 10 case is not unexpected as it would be the desirable allocation if there were exactly one other running task at nice == 0. -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Dmitry Adamushko wrote: On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. all 4 spinners "tend" to be on CPU0 (and as I understand each gets ~25% approx.?), so there must be plenty of moments for *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume together just a few % of CPU. Hence, we should not be that dependent on the load balancing interval here.. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. It's also worth noting that I've had tests where the allocation started out 2/2 and the system changed it to 3/1 where it stabilized. So it's not just a case of bad luck with the initial CPU allocation when the tasks start and the load balancing failing to fix it (which was one of my earlier theories). (unlikely consiparacy theory) It's not a conspiracy. It's just dumb luck. :-) - idle_balance() and load_balance() (the later is dependent on the load balancing interval which can be in sync. with top/gkerllm activities as you suggest) move always either top or gkerllm between themselves.. esp. if X is reniced (so it gets additional "weight") and happens to be active (on CPU1) when load_balance() (kicked from scheduler_tick()) runs.. p.s. it's mainly theoretical specualtions.. I recently started looking at the load-balancing code (unfortunatelly, don't have an SMP machine which I can upgrade to the recent kernel) and so far for me it's mainly about getting sure I see things sanely. I'm playing with some jitter experiments at the moment. The amount of jitter needs to be small (a few tenths of a second) as the synchronization (if it's happening) is happening at the seconds level as the intervals for top and gkrellm will be in the 1 to 5 second range (I guess -- I haven't checked) and the load balancing is every 60 seconds. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Dmitry Adamushko wrote: On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. all 4 spinners tend to be on CPU0 (and as I understand each gets ~25% approx.?), so there must be plenty of moments for *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume together just a few % of CPU. Hence, we should not be that dependent on the load balancing interval here.. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. It's also worth noting that I've had tests where the allocation started out 2/2 and the system changed it to 3/1 where it stabilized. So it's not just a case of bad luck with the initial CPU allocation when the tasks start and the load balancing failing to fix it (which was one of my earlier theories). (unlikely consiparacy theory) It's not a conspiracy. It's just dumb luck. :-) - idle_balance() and load_balance() (the later is dependent on the load balancing interval which can be in sync. with top/gkerllm activities as you suggest) move always either top or gkerllm between themselves.. esp. if X is reniced (so it gets additional weight) and happens to be active (on CPU1) when load_balance() (kicked from scheduler_tick()) runs.. p.s. it's mainly theoretical specualtions.. I recently started looking at the load-balancing code (unfortunatelly, don't have an SMP machine which I can upgrade to the recent kernel) and so far for me it's mainly about getting sure I see things sanely. I'm playing with some jitter experiments at the moment. The amount of jitter needs to be small (a few tenths of a second) as the synchronization (if it's happening) is happening at the seconds level as the intervals for top and gkrellm will be in the 1 to 5 second range (I guess -- I haven't checked) and the load balancing is every 60 seconds. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Peter Williams wrote: Dmitry Adamushko wrote: On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation.. No, and I haven't seen one. all 4 spinners tend to be on CPU0 (and as I understand each gets ~25% approx.?), so there must be plenty of moments for *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume together just a few % of CPU. Hence, we should not be that dependent on the load balancing interval here.. The split that I see is 3/1 and neither CPU seems to be favoured with respect to getting the majority. However, top, gkrellm and X seem to be always on the CPU with the single spinner. The CPU% reported by top is approx. 33%, 33%, 33% and 100% for the spinners. If I renice the spinners to -10 (so that there load weights dominate the run queue load calculations) the problem goes away and the spinner to CPU allocation is 2/2 and top reports them all getting approx. 50% each. For no good reason other than curiosity, I tried a variation of this experiment where I reniced the spinners to 10 instead of -10 and, to my surprise, they were allocated 2/2 to the CPUs on average. I say on average because the allocations were a little more volatile and occasionally 0/4 splits would occur but these would last for less than one top cycle before the 2/2 was re-established. The quickness of these recoveries would indicate that it was most likely the idle balance mechanism that restored the balance. This may point the finger at the tick based load balance mechanism being too conservative in when it decides whether tasks need to be moved. In the case where the spinners are at nice == 0, the idle balance mechanism never comes into play as the 0/4 split is never seen so only the tick based mechanism is in force in this case and this is where the anomalies are seen. This tick rebalance mechanism only situation is also true for the nice == -10 case but in this case the high load weights of the spinners overcomes the tick based load balancing mechanism's conservatism e.g. the difference in queue loads for a 1/3 split in this case is the equivalent to the difference that would be generated by an imbalance of about 18 nice == 0 spinners i.e. too big to be ignored. The evidence seems to indicate that IF a rebalance operation gets initiated then the right amount of load will get moved. This new evidence weakens (but does not totally destroy) my synchronization (a.k.a. conspiracy) theory. Peter PS As the total load weight for 4 nice == 10 tasks is only about 40% of the load weight of a single nice == 0 task, the occasional 0/4 split in the spinners at nice == 10 case is not unexpected as it would be the desirable allocation if there were exactly one other running task at nice == 0. -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Dmitry Adamushko wrote: On 18/05/07, Peter Williams <[EMAIL PROTECTED]> wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. A jittered load balancing interval should break the synchronization. This would certainly be simpler than trying to change the move_task() logic for selecting which tasks to move. Just an(quick) another idea. Say, the load balancer would consider not only p->load_weight but also something like Tw(task) = (time_spent_on_runqueue / total_task's_runtime) * some_scale_constant as an additional "load" component (OTOH, when a task starts, it takes some time for this parameter to become meaningful). I guess, it could address the scenarios your have described (but maybe break some others as well :) ... Any hints on why it's stupid? Well that is the kind of thing I was hoping to avoid for the reasons of complexity. I think that the actual implementation would be more complex than it sounds and possibly require multiple runs down the list of moveable tasks which would be bad for overhead. Basically, I don't think that the problem is serious enough to warrant a complex solution. But I may be wrong about how complex the implementation would be. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Dmitry Adamushko wrote: On 18/05/07, Peter Williams [EMAIL PROTECTED] wrote: [...] One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. A jittered load balancing interval should break the synchronization. This would certainly be simpler than trying to change the move_task() logic for selecting which tasks to move. Just an(quick) another idea. Say, the load balancer would consider not only p-load_weight but also something like Tw(task) = (time_spent_on_runqueue / total_task's_runtime) * some_scale_constant as an additional load component (OTOH, when a task starts, it takes some time for this parameter to become meaningful). I guess, it could address the scenarios your have described (but maybe break some others as well :) ... Any hints on why it's stupid? Well that is the kind of thing I was hoping to avoid for the reasons of complexity. I think that the actual implementation would be more complex than it sounds and possibly require multiple runs down the list of moveable tasks which would be bad for overhead. Basically, I don't think that the problem is serious enough to warrant a complex solution. But I may be wrong about how complex the implementation would be. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Peter Williams wrote: Ingo Molnar wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 with and without CFS; and the problem is always present. It's not "nice" related as the all four tasks are run at nice == 0. could you try -v13 and did this behavior get better in any way? It's still there but I've got a theory about what the problems is that is supported by some other tests I've done. What I'd forgotten is that I had gkrellm running as well as top (to observe which CPU tasks were on) at the same time as the spinners were running. This meant that between them top, gkrellm and X were using about 2% of the CPU -- not much but enough to make it possible that at least one of them was running when the load balancer was trying to do its thing. This raises two possibilities: 1. the system looked balanced and 2. the system didn't look balanced but one of top, gkrellm or X was moved instead of one of the spinners. If it's 1 then there's not much we can do about it except say that it only happens in these strange circumstances. If it's 2 then we may have to modify the way move_tasks() selects which tasks to move (if we think that the circumstances warrant it -- I'm not sure that this is the case). To examine these possibilities I tried two variations of the test. a. run the spinners at nice == -10 instead of nice == 0. When I did this the load balancing was perfect on 10 consecutive runs which according to my calculations makes it 99.997 certain that this didn't happen by chance. This supports theory 2 above. b. run the tests without gkrellm running but use nice == 0 for the spinners. When I did this the load balancing was mostly perfect but was quite volatile (switching between a 2/2 and 1/3 allocation of spinners to CPUs) but the %CPU allocation was quite good with the spinners all getting approximately 49% of a CPU each. This also supports theory 2 above and gives weak support to theory 1 above. This leaves the question of what to do about it. Given that most CPU intensive tasks on a real system probably only run for a few tens of milliseconds it probably won't matter much on a real system except that a malicious user could exploit it to disrupt a system. So my opinion is that we probably do need to do something about it but that it's not urgent. One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. A jittered load balancing interval should break the synchronization. This would certainly be simpler than trying to change the move_task() logic for selecting which tasks to move. I should have added that the reason I think this mooted synchronization is the cause of the problem is that I can think of no other way that tasks with such low activity (2% between the 3 of them) could cause the imbalance of the spinner to CPU allocation to be so persistent. What do you think? Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Ingo Molnar wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 with and without CFS; and the problem is always present. It's not "nice" related as the all four tasks are run at nice == 0. could you try -v13 and did this behavior get better in any way? It's still there but I've got a theory about what the problems is that is supported by some other tests I've done. What I'd forgotten is that I had gkrellm running as well as top (to observe which CPU tasks were on) at the same time as the spinners were running. This meant that between them top, gkrellm and X were using about 2% of the CPU -- not much but enough to make it possible that at least one of them was running when the load balancer was trying to do its thing. This raises two possibilities: 1. the system looked balanced and 2. the system didn't look balanced but one of top, gkrellm or X was moved instead of one of the spinners. If it's 1 then there's not much we can do about it except say that it only happens in these strange circumstances. If it's 2 then we may have to modify the way move_tasks() selects which tasks to move (if we think that the circumstances warrant it -- I'm not sure that this is the case). To examine these possibilities I tried two variations of the test. a. run the spinners at nice == -10 instead of nice == 0. When I did this the load balancing was perfect on 10 consecutive runs which according to my calculations makes it 99.997 certain that this didn't happen by chance. This supports theory 2 above. b. run the tests without gkrellm running but use nice == 0 for the spinners. When I did this the load balancing was mostly perfect but was quite volatile (switching between a 2/2 and 1/3 allocation of spinners to CPUs) but the %CPU allocation was quite good with the spinners all getting approximately 49% of a CPU each. This also supports theory 2 above and gives weak support to theory 1 above. This leaves the question of what to do about it. Given that most CPU intensive tasks on a real system probably only run for a few tens of milliseconds it probably won't matter much on a real system except that a malicious user could exploit it to disrupt a system. So my opinion is that we probably do need to do something about it but that it's not urgent. One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. A jittered load balancing interval should break the synchronization. This would certainly be simpler than trying to change the move_task() logic for selecting which tasks to move. What do you think? Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 with and without CFS; and the problem is always present. It's not nice related as the all four tasks are run at nice == 0. could you try -v13 and did this behavior get better in any way? It's still there but I've got a theory about what the problems is that is supported by some other tests I've done. What I'd forgotten is that I had gkrellm running as well as top (to observe which CPU tasks were on) at the same time as the spinners were running. This meant that between them top, gkrellm and X were using about 2% of the CPU -- not much but enough to make it possible that at least one of them was running when the load balancer was trying to do its thing. This raises two possibilities: 1. the system looked balanced and 2. the system didn't look balanced but one of top, gkrellm or X was moved instead of one of the spinners. If it's 1 then there's not much we can do about it except say that it only happens in these strange circumstances. If it's 2 then we may have to modify the way move_tasks() selects which tasks to move (if we think that the circumstances warrant it -- I'm not sure that this is the case). To examine these possibilities I tried two variations of the test. a. run the spinners at nice == -10 instead of nice == 0. When I did this the load balancing was perfect on 10 consecutive runs which according to my calculations makes it 99.997 certain that this didn't happen by chance. This supports theory 2 above. b. run the tests without gkrellm running but use nice == 0 for the spinners. When I did this the load balancing was mostly perfect but was quite volatile (switching between a 2/2 and 1/3 allocation of spinners to CPUs) but the %CPU allocation was quite good with the spinners all getting approximately 49% of a CPU each. This also supports theory 2 above and gives weak support to theory 1 above. This leaves the question of what to do about it. Given that most CPU intensive tasks on a real system probably only run for a few tens of milliseconds it probably won't matter much on a real system except that a malicious user could exploit it to disrupt a system. So my opinion is that we probably do need to do something about it but that it's not urgent. One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. A jittered load balancing interval should break the synchronization. This would certainly be simpler than trying to change the move_task() logic for selecting which tasks to move. What do you think? Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Peter Williams wrote: Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 with and without CFS; and the problem is always present. It's not nice related as the all four tasks are run at nice == 0. could you try -v13 and did this behavior get better in any way? It's still there but I've got a theory about what the problems is that is supported by some other tests I've done. What I'd forgotten is that I had gkrellm running as well as top (to observe which CPU tasks were on) at the same time as the spinners were running. This meant that between them top, gkrellm and X were using about 2% of the CPU -- not much but enough to make it possible that at least one of them was running when the load balancer was trying to do its thing. This raises two possibilities: 1. the system looked balanced and 2. the system didn't look balanced but one of top, gkrellm or X was moved instead of one of the spinners. If it's 1 then there's not much we can do about it except say that it only happens in these strange circumstances. If it's 2 then we may have to modify the way move_tasks() selects which tasks to move (if we think that the circumstances warrant it -- I'm not sure that this is the case). To examine these possibilities I tried two variations of the test. a. run the spinners at nice == -10 instead of nice == 0. When I did this the load balancing was perfect on 10 consecutive runs which according to my calculations makes it 99.997 certain that this didn't happen by chance. This supports theory 2 above. b. run the tests without gkrellm running but use nice == 0 for the spinners. When I did this the load balancing was mostly perfect but was quite volatile (switching between a 2/2 and 1/3 allocation of spinners to CPUs) but the %CPU allocation was quite good with the spinners all getting approximately 49% of a CPU each. This also supports theory 2 above and gives weak support to theory 1 above. This leaves the question of what to do about it. Given that most CPU intensive tasks on a real system probably only run for a few tens of milliseconds it probably won't matter much on a real system except that a malicious user could exploit it to disrupt a system. So my opinion is that we probably do need to do something about it but that it's not urgent. One thing that might work is to jitter the load balancing interval a bit. The reason I say this is that one of the characteristics of top and gkrellm is that they run at a more or less constant interval (and, in this case, X would also be following this pattern as it's doing screen updates for top and gkrellm) and this means that it's possible for the load balancing interval to synchronize with their intervals which in turn causes the observed problem. A jittered load balancing interval should break the synchronization. This would certainly be simpler than trying to change the move_task() logic for selecting which tasks to move. I should have added that the reason I think this mooted synchronization is the cause of the problem is that I can think of no other way that tasks with such low activity (2% between the 3 of them) could cause the imbalance of the spinner to CPU allocation to be so persistent. What do you think? Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Ingo Molnar wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: Load balancing appears to be badly broken in this version. When I started 4 hard spinners on my 2 CPU machine one ended up on one CPU and the other 3 on the other CPU and they stayed there. could you try to debug this a bit more? I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 with and without CFS; and the problem is always present. It's not "nice" related as the all four tasks are run at nice == 0. It's possible that this problem has been in the kernel for a while with out being noticed as, even with totally random allocation of tasks to CPUs without any (attempt to balance), there's a quite high probability of the desirable 2/2 split occurring. So one needs to repeat the test several times to have reasonable assurance that the problem is not present. I.e. this has the characteristics of an intermittent bug with all the debugging problems that introduces. The probabilities for the 3 split possibilities for random allocation are: 2/2 (the desired outcome) is 3/8 likely, 1/3 is 4/8 likely, and 0/4 is 1/8 likely. I'm pretty sure that this problem wasn't present when smpnice went into the kernel which is the last time I did a lot of load balance testing. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: Load balancing appears to be badly broken in this version. When I started 4 hard spinners on my 2 CPU machine one ended up on one CPU and the other 3 on the other CPU and they stayed there. could you try to debug this a bit more? I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 with and without CFS; and the problem is always present. It's not nice related as the all four tasks are run at nice == 0. It's possible that this problem has been in the kernel for a while with out being noticed as, even with totally random allocation of tasks to CPUs without any (attempt to balance), there's a quite high probability of the desirable 2/2 split occurring. So one needs to repeat the test several times to have reasonable assurance that the problem is not present. I.e. this has the characteristics of an intermittent bug with all the debugging problems that introduces. The probabilities for the 3 split possibilities for random allocation are: 2/2 (the desired outcome) is 3/8 likely, 1/3 is 4/8 likely, and 0/4 is 1/8 likely. I'm pretty sure that this problem wasn't present when smpnice went into the kernel which is the last time I did a lot of load balance testing. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Ingo Molnar wrote: * Peter Williams <[EMAIL PROTECTED]> wrote: As usual, any sort of feedback, bugreport, fix and suggestion is more than welcome, Load balancing appears to be badly broken in this version. When I started 4 hard spinners on my 2 CPU machine one ended up on one CPU and the other 3 on the other CPU and they stayed there. hm, i cannot reproduce this on 4 different SMP boxen, trying various combinations of SCHED_SMT/MC You may need to try more than once. Testing load balancing can be a pain as there's always a possibility you'll get a good result just by chance. I.e. you need a bunch of good results to say it's OK but only one bad result to say it's broken. This makes testing load balancing a pain. and other .config options that might make a difference to balancing. Could you send me your .config? Sent separately. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Ingo Molnar wrote: * Peter Williams [EMAIL PROTECTED] wrote: As usual, any sort of feedback, bugreport, fix and suggestion is more than welcome, Load balancing appears to be badly broken in this version. When I started 4 hard spinners on my 2 CPU machine one ended up on one CPU and the other 3 on the other CPU and they stayed there. hm, i cannot reproduce this on 4 different SMP boxen, trying various combinations of SCHED_SMT/MC You may need to try more than once. Testing load balancing can be a pain as there's always a possibility you'll get a good result just by chance. I.e. you need a bunch of good results to say it's OK but only one bad result to say it's broken. This makes testing load balancing a pain. and other .config options that might make a difference to balancing. Could you send me your .config? Sent separately. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Ingo Molnar wrote: i'm pleased to announce release -v12 of the CFS scheduler patchset. The CFS patch against v2.6.22-rc1, v2.6.21.1 or v2.6.20.10 can be downloaded from the usual place: http://people.redhat.com/mingo/cfs-scheduler/ -v12 fixes the '3D bug' that caused trivial latencies in 3D games: it turns out that the problem was not resulting out of any core quality of CFS, it was caused by 3D userspace growing dependent on the current inefficiency of the vanilla scheduler's sys_sched_yield() implementation, and CFS's "make yield work well" changes broke it. Even a simple 3D app like glxgears does a sys_sched_yield() for every frame it generates (!) on certain 3D cards, which in essence punishes any scheduler that implements sys_sched_yield() in a sane manner. This interaction of CFS's yield implementation with this user-space bug could be the main reason why some testers reported SD to be handling 3D games better than CFS. (SD uses a yield implementation similar to the vanilla scheduler.) So i've added a yield workaround to -v12, which makes it work similar to how the vanilla scheduler and SD does it. (Xorg has been notified and this bug should be fixed there too. This took some time to debug because the 3D driver i'm using for testing does not use sys_sched_yield().) The workaround is activated by default so -v12 should work 'out of the box'. Mike Galbraith has fixed a bug related to nice levels - the fix should make negative nice levels more potent again. Changes since -v10: - nice level calculation fixes (Mike Galbraith) - load-balancing improvements (this should fix the SMP performance problem reported by Michael Gerdau) - remove the sched_sleep_history_max tunable. - more debugging fields. - various cleanups, fixlets and code reorganization As usual, any sort of feedback, bugreport, fix and suggestion is more than welcome, Load balancing appears to be badly broken in this version. When I started 4 hard spinners on my 2 CPU machine one ended up on one CPU and the other 3 on the other CPU and they stayed there. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
Ingo Molnar wrote: i'm pleased to announce release -v12 of the CFS scheduler patchset. The CFS patch against v2.6.22-rc1, v2.6.21.1 or v2.6.20.10 can be downloaded from the usual place: http://people.redhat.com/mingo/cfs-scheduler/ -v12 fixes the '3D bug' that caused trivial latencies in 3D games: it turns out that the problem was not resulting out of any core quality of CFS, it was caused by 3D userspace growing dependent on the current inefficiency of the vanilla scheduler's sys_sched_yield() implementation, and CFS's make yield work well changes broke it. Even a simple 3D app like glxgears does a sys_sched_yield() for every frame it generates (!) on certain 3D cards, which in essence punishes any scheduler that implements sys_sched_yield() in a sane manner. This interaction of CFS's yield implementation with this user-space bug could be the main reason why some testers reported SD to be handling 3D games better than CFS. (SD uses a yield implementation similar to the vanilla scheduler.) So i've added a yield workaround to -v12, which makes it work similar to how the vanilla scheduler and SD does it. (Xorg has been notified and this bug should be fixed there too. This took some time to debug because the 3D driver i'm using for testing does not use sys_sched_yield().) The workaround is activated by default so -v12 should work 'out of the box'. Mike Galbraith has fixed a bug related to nice levels - the fix should make negative nice levels more potent again. Changes since -v10: - nice level calculation fixes (Mike Galbraith) - load-balancing improvements (this should fix the SMP performance problem reported by Michael Gerdau) - remove the sched_sleep_history_max tunable. - more debugging fields. - various cleanups, fixlets and code reorganization As usual, any sort of feedback, bugreport, fix and suggestion is more than welcome, Load balancing appears to be badly broken in this version. When I started 4 hard spinners on my 2 CPU machine one ended up on one CPU and the other 3 on the other CPU and they stayed there. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v8
Esben Nielsen wrote: On Tue, 8 May 2007, Peter Williams wrote: Esben Nielsen wrote: On Sun, 6 May 2007, Linus Torvalds wrote: > > > On Sun, 6 May 2007, Ingo Molnar wrote: > > > > * Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > > So the _only_ valid way to handle timers is to > > > - either not allow wrapping at all (in which case "unsigned" is > > > better, > > > since it is bigger) > > > - or use wrapping explicitly, and use unsigned arithmetic (which is > > > well-defined in C) and do something like "(long)(a-b) > 0". > > > > hm, there is a corner-case in CFS where a fix like this is necessary. > > > > CFS uses 64-bit values for almost everything, and the majority of > > values > > are of 'relative' nature with no danger of overflow. (They are signed > > because they are relative values that center around zero and can be > > negative or positive.) > > Well, I'd like to just worry about that for a while. > > You say there is "no danger of overflow", and I mostly agree that once > we're talking about 64-bit values, the overflow issue simply doesn't > exist, and furthermore the difference between 63 and 64 bits is not > really > relevant, so there's no major reason to actively avoid signed entries. > > So in that sense, it all sounds perfectly sane. And I'm definitely not > sure your "292 years after bootup" worry is really worth even > considering. > I would hate to tell mission control for Mankind's first mission to another star to reboot every 200 years because "there is no need to worry about it." As a matter of principle an OS should never need a reboot (with exception for upgrading). If you say you have to reboot every 200 years, why not every 100? Every 50? Every 45 days (you know what I am referring to :-) ? There's always going to be an upper limit on the representation of time. At least until we figure out how to implement infinity properly. Well you need infinite memory for that :-) But that is my point: Why go into the problem of storing absolute time when you can use relative time? I'd reverse that and say "Why go to the bother of using relative time when you can use absolute time?". Absolute time being time since boot, of course. > When we're really so well off that we expect the hardware and software > stack to be stable over a hundred years, I'd start to think about issues > like that, in the meantime, to me worrying about those kinds of issues > just means that you're worrying about the wrong things. > > BUT. > > There's a fundamental reason relative timestamps are difficult and > almost > always have overflow issues: the "long long in the future" case as an > approximation of "infinite timeout" is almost always relevant. > > So rather than worry about the system staying up 292 years, I'd worry > about whether people pass in big numbers (like some MAX_S64 > approximation) > as an approximation for "infinite", and once you have things like that, > the "64 bits never overflows" argument is totally bogus. > > There's a damn good reason for using only *absolute* time. The whole > "signed values of relative time" may _sound_ good, but it really sucks > in > subtle and horrible ways! > I think you are wrong here. The only place you need absolute time is a for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit representation when you could have used a 32 bit. With a 32 bit implementation you are forced to handle the corner cases with wrap around and too big arguments up front. With a 64 bit you hide those problems. As does the other method. A 32 bit signed offset with a 32 bit base is just a crude version of 64 bit absolute time. 64 bit is also relative - just over a much longer period. Yes, relative to boot. 32 bit signed offset is relative - and you know it. But with 64 people think it is absolute and put in large values as Linus said above. What people? Who gets to feed times into the scheduler? Isn't it just using the time as determined by the system? With 32 bit future developers will know it is relative and code for it. And they will get their corner cases tested, because the code soon will run into those corners. I think CFS would be best off using a 32 bit timer counting in micro seconds. That would wrap around in 72 minuttes. But as the timers are relative you will never be able to specify a timer larger than 36 minuttes in the future. But 36 minuttes is redicolously long for a scheduler and a simple test limiting time values to that value would not break
Re: [patch] CFS scheduler, -v8
Esben Nielsen wrote: On Tue, 8 May 2007, Peter Williams wrote: Esben Nielsen wrote: On Sun, 6 May 2007, Linus Torvalds wrote: On Sun, 6 May 2007, Ingo Molnar wrote: * Linus Torvalds [EMAIL PROTECTED] wrote: So the _only_ valid way to handle timers is to - either not allow wrapping at all (in which case unsigned is better, since it is bigger) - or use wrapping explicitly, and use unsigned arithmetic (which is well-defined in C) and do something like (long)(a-b) 0. hm, there is a corner-case in CFS where a fix like this is necessary. CFS uses 64-bit values for almost everything, and the majority ofvalues are of 'relative' nature with no danger of overflow. (They are signed because they are relative values that center around zero and can be negative or positive.) Well, I'd like to just worry about that for a while. You say there is no danger of overflow, and I mostly agree that once we're talking about 64-bit values, the overflow issue simply doesn't exist, and furthermore the difference between 63 and 64 bits is not really relevant, so there's no major reason to actively avoid signed entries. So in that sense, it all sounds perfectly sane. And I'm definitely not sure your 292 years after bootup worry is really worth even considering. I would hate to tell mission control for Mankind's first mission to another star to reboot every 200 years because there is no need to worry about it. As a matter of principle an OS should never need a reboot (with exception for upgrading). If you say you have to reboot every 200 years, why not every 100? Every 50? Every 45 days (you know what I am referring to :-) ? There's always going to be an upper limit on the representation of time. At least until we figure out how to implement infinity properly. Well you need infinite memory for that :-) But that is my point: Why go into the problem of storing absolute time when you can use relative time? I'd reverse that and say Why go to the bother of using relative time when you can use absolute time?. Absolute time being time since boot, of course. When we're really so well off that we expect the hardware and software stack to be stable over a hundred years, I'd start to think about issues like that, in the meantime, to me worrying about those kinds of issues just means that you're worrying about the wrong things. BUT. There's a fundamental reason relative timestamps are difficult and almost always have overflow issues: the long long in the future case as an approximation of infinite timeout is almost always relevant. So rather than worry about the system staying up 292 years, I'd worry about whether people pass in big numbers (like some MAX_S64 approximation) as an approximation for infinite, and once you have things like that, the 64 bits never overflows argument is totally bogus. There's a damn good reason for using only *absolute* time. The whole signed values of relative time may _sound_ good, but it really sucks in subtle and horrible ways! I think you are wrong here. The only place you need absolute time is a for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit representation when you could have used a 32 bit. With a 32 bit implementation you are forced to handle the corner cases with wrap around and too big arguments up front. With a 64 bit you hide those problems. As does the other method. A 32 bit signed offset with a 32 bit base is just a crude version of 64 bit absolute time. 64 bit is also relative - just over a much longer period. Yes, relative to boot. 32 bit signed offset is relative - and you know it. But with 64 people think it is absolute and put in large values as Linus said above. What people? Who gets to feed times into the scheduler? Isn't it just using the time as determined by the system? With 32 bit future developers will know it is relative and code for it. And they will get their corner cases tested, because the code soon will run into those corners. I think CFS would be best off using a 32 bit timer counting in micro seconds. That would wrap around in 72 minuttes. But as the timers are relative you will never be able to specify a timer larger than 36 minuttes in the future. But 36 minuttes is redicolously long for a scheduler and a simple test limiting time values to that value would not break anything. Except if you're measuring sleep times. I think that you'll find lots of tasks sleep for more than 72 minutes. I don't think those large values will be relavant. You can easily cut off sleep times at 30 min or even 1 min. The aim is to make the code as simple as possible not add this kind of rubbish and 1 minute would be far too low. But you need to detect that the task have indeed been sleeping 2^32+1 usec and not 1 usec. You can't do
Re: [patch] CFS scheduler, -v8
Esben Nielsen wrote: On Sun, 6 May 2007, Linus Torvalds wrote: On Sun, 6 May 2007, Ingo Molnar wrote: * Linus Torvalds <[EMAIL PROTECTED]> wrote: So the _only_ valid way to handle timers is to - either not allow wrapping at all (in which case "unsigned" is better, since it is bigger) - or use wrapping explicitly, and use unsigned arithmetic (which is well-defined in C) and do something like "(long)(a-b) > 0". hm, there is a corner-case in CFS where a fix like this is necessary. CFS uses 64-bit values for almost everything, and the majority of values are of 'relative' nature with no danger of overflow. (They are signed because they are relative values that center around zero and can be negative or positive.) Well, I'd like to just worry about that for a while. You say there is "no danger of overflow", and I mostly agree that once we're talking about 64-bit values, the overflow issue simply doesn't exist, and furthermore the difference between 63 and 64 bits is not really relevant, so there's no major reason to actively avoid signed entries. So in that sense, it all sounds perfectly sane. And I'm definitely not sure your "292 years after bootup" worry is really worth even considering. I would hate to tell mission control for Mankind's first mission to another star to reboot every 200 years because "there is no need to worry about it." As a matter of principle an OS should never need a reboot (with exception for upgrading). If you say you have to reboot every 200 years, why not every 100? Every 50? Every 45 days (you know what I am referring to :-) ? There's always going to be an upper limit on the representation of time. At least until we figure out how to implement infinity properly. When we're really so well off that we expect the hardware and software stack to be stable over a hundred years, I'd start to think about issues like that, in the meantime, to me worrying about those kinds of issues just means that you're worrying about the wrong things. BUT. There's a fundamental reason relative timestamps are difficult and almost always have overflow issues: the "long long in the future" case as an approximation of "infinite timeout" is almost always relevant. So rather than worry about the system staying up 292 years, I'd worry about whether people pass in big numbers (like some MAX_S64 approximation) as an approximation for "infinite", and once you have things like that, the "64 bits never overflows" argument is totally bogus. There's a damn good reason for using only *absolute* time. The whole "signed values of relative time" may _sound_ good, but it really sucks in subtle and horrible ways! I think you are wrong here. The only place you need absolute time is a for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit representation when you could have used a 32 bit. With a 32 bit implementation you are forced to handle the corner cases with wrap around and too big arguments up front. With a 64 bit you hide those problems. As does the other method. A 32 bit signed offset with a 32 bit base is just a crude version of 64 bit absolute time. I think CFS would be best off using a 32 bit timer counting in micro seconds. That would wrap around in 72 minuttes. But as the timers are relative you will never be able to specify a timer larger than 36 minuttes in the future. But 36 minuttes is redicolously long for a scheduler and a simple test limiting time values to that value would not break anything. Except if you're measuring sleep times. I think that you'll find lots of tasks sleep for more than 72 minutes. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v8
Esben Nielsen wrote: On Sun, 6 May 2007, Linus Torvalds wrote: On Sun, 6 May 2007, Ingo Molnar wrote: * Linus Torvalds [EMAIL PROTECTED] wrote: So the _only_ valid way to handle timers is to - either not allow wrapping at all (in which case unsigned is better, since it is bigger) - or use wrapping explicitly, and use unsigned arithmetic (which is well-defined in C) and do something like (long)(a-b) 0. hm, there is a corner-case in CFS where a fix like this is necessary. CFS uses 64-bit values for almost everything, and the majority of values are of 'relative' nature with no danger of overflow. (They are signed because they are relative values that center around zero and can be negative or positive.) Well, I'd like to just worry about that for a while. You say there is no danger of overflow, and I mostly agree that once we're talking about 64-bit values, the overflow issue simply doesn't exist, and furthermore the difference between 63 and 64 bits is not really relevant, so there's no major reason to actively avoid signed entries. So in that sense, it all sounds perfectly sane. And I'm definitely not sure your 292 years after bootup worry is really worth even considering. I would hate to tell mission control for Mankind's first mission to another star to reboot every 200 years because there is no need to worry about it. As a matter of principle an OS should never need a reboot (with exception for upgrading). If you say you have to reboot every 200 years, why not every 100? Every 50? Every 45 days (you know what I am referring to :-) ? There's always going to be an upper limit on the representation of time. At least until we figure out how to implement infinity properly. When we're really so well off that we expect the hardware and software stack to be stable over a hundred years, I'd start to think about issues like that, in the meantime, to me worrying about those kinds of issues just means that you're worrying about the wrong things. BUT. There's a fundamental reason relative timestamps are difficult and almost always have overflow issues: the long long in the future case as an approximation of infinite timeout is almost always relevant. So rather than worry about the system staying up 292 years, I'd worry about whether people pass in big numbers (like some MAX_S64 approximation) as an approximation for infinite, and once you have things like that, the 64 bits never overflows argument is totally bogus. There's a damn good reason for using only *absolute* time. The whole signed values of relative time may _sound_ good, but it really sucks in subtle and horrible ways! I think you are wrong here. The only place you need absolute time is a for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit representation when you could have used a 32 bit. With a 32 bit implementation you are forced to handle the corner cases with wrap around and too big arguments up front. With a 64 bit you hide those problems. As does the other method. A 32 bit signed offset with a 32 bit base is just a crude version of 64 bit absolute time. I think CFS would be best off using a 32 bit timer counting in micro seconds. That would wrap around in 72 minuttes. But as the timers are relative you will never be able to specify a timer larger than 36 minuttes in the future. But 36 minuttes is redicolously long for a scheduler and a simple test limiting time values to that value would not break anything. Except if you're measuring sleep times. I think that you'll find lots of tasks sleep for more than 72 minutes. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.21
The main change in this version is to fix bugs introduced to SPA schedulers during modifications to handle a recent change to the scheduler driver interface to take account of recent changes to the load balancing code. This patch also includes a patch to sis900 code to enable it to boot on my system for testing (patch supplied by Neil Horman). A patch for 2.6.21 is available at: <http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.21.patch> and a quilt/gquilt patch series is available at: <http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.21.patch-series.tar.gz> Very Brief Documentation: You can select a default scheduler at kernel build time. If you wish to boot with a scheduler other than the default it can be selected at boot time by adding: cpusched= to the boot command line where is one of: ingosched, ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs or zaphod. If you don't change the default when you build the kernel the default scheduler will be ingosched (which is the normal scheduler). The scheduler in force on a running system can be determined by the contents of: /proc/scheduler Control parameters for the scheduler can be read/set via files in: /sys/cpusched// Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ANNOUNCE][RFC] PlugSched-6.5.1 for 2.6.21
The main change in this version is to fix bugs introduced to SPA schedulers during modifications to handle a recent change to the scheduler driver interface to take account of recent changes to the load balancing code. This patch also includes a patch to sis900 code to enable it to boot on my system for testing (patch supplied by Neil Horman). A patch for 2.6.21 is available at: http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.21.patch and a quilt/gquilt patch series is available at: http://downloads.sourceforge.net/cpuse/plugsched-6.5.1-for-2.6.21.patch-series.tar.gz Very Brief Documentation: You can select a default scheduler at kernel build time. If you wish to boot with a scheduler other than the default it can be selected at boot time by adding: cpusched=scheduler to the boot command line where scheduler is one of: ingosched, ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs or zaphod. If you don't change the default when you build the kernel the default scheduler will be ingosched (which is the normal scheduler). The scheduler in force on a running system can be determined by the contents of: /proc/scheduler Control parameters for the scheduler can be read/set via files in: /sys/cpusched/scheduler/ Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.21 hangs during post boot initialization phase
Neil Horman wrote: On Sat, Apr 28, 2007 at 12:28:28AM +1000, Peter Williams wrote: Neil Horman wrote: On Fri, Apr 27, 2007 at 04:05:11PM +1000, Peter Williams wrote: Damn, This is what happens when I try to do things too quickly. I missed one spot in my last patch where I replaced skb with rx_skb. Its not critical, but it should improve sis900 performance by quite a bit. This applies on top of the last two patches. Sorry about that. Thanks & Regards Neil Signed-off-by: Neil Horman <[EMAIL PROTECTED]> sis900.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c index 7e44939..db59dce 100644 --- a/drivers/net/sis900.c +++ b/drivers/net/sis900.c @@ -1790,7 +1790,7 @@ static int sis900_rx(struct net_device *net_dev) /* give the socket buffer to upper layers */ rx_skb = sis_priv->rx_skbuff[entry]; skb_put(rx_skb, rx_size); - skb->protocol = eth_type_trans(rx_skb, net_dev); + rx_skb->protocol = eth_type_trans(rx_skb, net_dev); netif_rx(rx_skb); /* some network statistics */ My system also boots OK after I add this patch. Can't tell whether it's improved the performance or not. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.21 hangs during post boot initialization phase
Neil Horman wrote: On Fri, Apr 27, 2007 at 04:05:11PM +1000, Peter Williams wrote: Linus Torvalds wrote: On Fri, 27 Apr 2007, Peter Williams wrote: The 2.6.21 kernel is hanging during the post boot phase where various daemons are being started (not always the same daemon unfortunately). This problem was not present in 2.6.21-rc7 and there is no oops or other unusual output in the system log at the time the hang occurs. Can you use "git bisect" to narrow it down a bit more? It's only 125 commits, so bisecting even just three or four kernels will narrow it down to a handful. As the changes became, smaller the builds became quicker :-) and after 7 iterations we have: author Neil Horman <[EMAIL PROTECTED]> Fri, 20 Apr 2007 13:54:58 + (09:54 -0400) committer Jeff Garzik <[EMAIL PROTECTED]> Tue, 24 Apr 2007 16:43:07 + (12:43 -0400) commit b748d9e3b80dc7e6ce6bf7399f57964b99a4104c tree887909e1f735bb444ef0e3e370f34401fa6eee02tree | snapshot parent d91c088b39e3c66d309938de858775bb90fd1eadcommit | diff sis900: Allocate rx replacement buffer before rx operation The sis900 driver appears to have a bug in which the receive routine passes the skbuff holding the received frame to the network stack before refilling the buffer in the rx ring. If a new skbuff cannot be allocated, the driver simply leaves a hole in the rx ring, which causes the driver to stop receiving frames and become non-recoverable without an rmmod/insmod according to reporters. This patch reverses that order, attempting to allocate a replacement buffer first, and receiving the new frame only if one can be allocated. If no skbuff can be allocated, the current skbuf in the rx ring is recycled, dropping the current frame, but keeping the NIC operational. Signed-off-by: Neil Horman <[EMAIL PROTECTED]> Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce This was reported to me last night, and I've posted a patch to fix it, its available here: http://marc.info/?l=linux-netdev=117761259222165=2 It applies on top of the previous patch, and should fix your problem. Here's a copy of the patch Thanks & Regards Neil diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c index a6a0f09..7e44939 100644 --- a/drivers/net/sis900.c +++ b/drivers/net/sis900.c @@ -1754,6 +1754,7 @@ static int sis900_rx(struct net_device *net_dev) sis_priv->rx_ring[entry].cmdsts = RX_BUF_SIZE; } else { struct sk_buff * skb; + struct sk_buff * rx_skb; pci_unmap_single(sis_priv->pci_dev, sis_priv->rx_ring[entry].bufptr, RX_BUF_SIZE, @@ -1787,10 +1788,10 @@ static int sis900_rx(struct net_device *net_dev) } /* give the socket buffer to upper layers */ - skb = sis_priv->rx_skbuff[entry]; - skb_put(skb, rx_size); - skb->protocol = eth_type_trans(skb, net_dev); - netif_rx(skb); + rx_skb = sis_priv->rx_skbuff[entry]; + skb_put(rx_skb, rx_size); + skb->protocol = eth_type_trans(rx_skb, net_dev); + netif_rx(rx_skb); /* some network statistics */ if ((rx_status & BCAST) == MCAST) This patch fixes the problem for me. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.21 hangs during post boot initialization phase
Linus Torvalds wrote: On Fri, 27 Apr 2007, Peter Williams wrote: The 2.6.21 kernel is hanging during the post boot phase where various daemons are being started (not always the same daemon unfortunately). This problem was not present in 2.6.21-rc7 and there is no oops or other unusual output in the system log at the time the hang occurs. Can you use "git bisect" to narrow it down a bit more? It's only 125 commits, so bisecting even just three or four kernels will narrow it down to a handful. As the changes became, smaller the builds became quicker :-) and after 7 iterations we have: author Neil Horman <[EMAIL PROTECTED]> Fri, 20 Apr 2007 13:54:58 + (09:54 -0400) committer Jeff Garzik <[EMAIL PROTECTED]> Tue, 24 Apr 2007 16:43:07 + (12:43 -0400) commit b748d9e3b80dc7e6ce6bf7399f57964b99a4104c tree887909e1f735bb444ef0e3e370f34401fa6eee02tree | snapshot parent d91c088b39e3c66d309938de858775bb90fd1eadcommit | diff sis900: Allocate rx replacement buffer before rx operation The sis900 driver appears to have a bug in which the receive routine passes the skbuff holding the received frame to the network stack before refilling the buffer in the rx ring. If a new skbuff cannot be allocated, the driver simply leaves a hole in the rx ring, which causes the driver to stop receiving frames and become non-recoverable without an rmmod/insmod according to reporters. This patch reverses that order, attempting to allocate a replacement buffer first, and receiving the new frame only if one can be allocated. If no skbuff can be allocated, the current skbuf in the rx ring is recycled, dropping the current frame, but keeping the NIC operational. Signed-off-by: Neil Horman <[EMAIL PROTECTED]> Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.21 hangs during post boot initialization phase
Linus Torvalds wrote: On Fri, 27 Apr 2007, Peter Williams wrote: The 2.6.21 kernel is hanging during the post boot phase where various daemons are being started (not always the same daemon unfortunately). This problem was not present in 2.6.21-rc7 and there is no oops or other unusual output in the system log at the time the hang occurs. Can you use git bisect to narrow it down a bit more? It's only 125 commits, so bisecting even just three or four kernels will narrow it down to a handful. As the changes became, smaller the builds became quicker :-) and after 7 iterations we have: author Neil Horman [EMAIL PROTECTED] Fri, 20 Apr 2007 13:54:58 + (09:54 -0400) committer Jeff Garzik [EMAIL PROTECTED] Tue, 24 Apr 2007 16:43:07 + (12:43 -0400) commit b748d9e3b80dc7e6ce6bf7399f57964b99a4104c tree887909e1f735bb444ef0e3e370f34401fa6eee02tree | snapshot parent d91c088b39e3c66d309938de858775bb90fd1eadcommit | diff sis900: Allocate rx replacement buffer before rx operation The sis900 driver appears to have a bug in which the receive routine passes the skbuff holding the received frame to the network stack before refilling the buffer in the rx ring. If a new skbuff cannot be allocated, the driver simply leaves a hole in the rx ring, which causes the driver to stop receiving frames and become non-recoverable without an rmmod/insmod according to reporters. This patch reverses that order, attempting to allocate a replacement buffer first, and receiving the new frame only if one can be allocated. If no skbuff can be allocated, the current skbuf in the rx ring is recycled, dropping the current frame, but keeping the NIC operational. Signed-off-by: Neil Horman [EMAIL PROTECTED] Signed-off-by: Jeff Garzik [EMAIL PROTECTED] Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.21 hangs during post boot initialization phase
Neil Horman wrote: On Fri, Apr 27, 2007 at 04:05:11PM +1000, Peter Williams wrote: Linus Torvalds wrote: On Fri, 27 Apr 2007, Peter Williams wrote: The 2.6.21 kernel is hanging during the post boot phase where various daemons are being started (not always the same daemon unfortunately). This problem was not present in 2.6.21-rc7 and there is no oops or other unusual output in the system log at the time the hang occurs. Can you use git bisect to narrow it down a bit more? It's only 125 commits, so bisecting even just three or four kernels will narrow it down to a handful. As the changes became, smaller the builds became quicker :-) and after 7 iterations we have: author Neil Horman [EMAIL PROTECTED] Fri, 20 Apr 2007 13:54:58 + (09:54 -0400) committer Jeff Garzik [EMAIL PROTECTED] Tue, 24 Apr 2007 16:43:07 + (12:43 -0400) commit b748d9e3b80dc7e6ce6bf7399f57964b99a4104c tree887909e1f735bb444ef0e3e370f34401fa6eee02tree | snapshot parent d91c088b39e3c66d309938de858775bb90fd1eadcommit | diff sis900: Allocate rx replacement buffer before rx operation The sis900 driver appears to have a bug in which the receive routine passes the skbuff holding the received frame to the network stack before refilling the buffer in the rx ring. If a new skbuff cannot be allocated, the driver simply leaves a hole in the rx ring, which causes the driver to stop receiving frames and become non-recoverable without an rmmod/insmod according to reporters. This patch reverses that order, attempting to allocate a replacement buffer first, and receiving the new frame only if one can be allocated. If no skbuff can be allocated, the current skbuf in the rx ring is recycled, dropping the current frame, but keeping the NIC operational. Signed-off-by: Neil Horman [EMAIL PROTECTED] Signed-off-by: Jeff Garzik [EMAIL PROTECTED] Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce This was reported to me last night, and I've posted a patch to fix it, its available here: http://marc.info/?l=linux-netdevm=117761259222165w=2 It applies on top of the previous patch, and should fix your problem. Here's a copy of the patch Thanks Regards Neil diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c index a6a0f09..7e44939 100644 --- a/drivers/net/sis900.c +++ b/drivers/net/sis900.c @@ -1754,6 +1754,7 @@ static int sis900_rx(struct net_device *net_dev) sis_priv-rx_ring[entry].cmdsts = RX_BUF_SIZE; } else { struct sk_buff * skb; + struct sk_buff * rx_skb; pci_unmap_single(sis_priv-pci_dev, sis_priv-rx_ring[entry].bufptr, RX_BUF_SIZE, @@ -1787,10 +1788,10 @@ static int sis900_rx(struct net_device *net_dev) } /* give the socket buffer to upper layers */ - skb = sis_priv-rx_skbuff[entry]; - skb_put(skb, rx_size); - skb-protocol = eth_type_trans(skb, net_dev); - netif_rx(skb); + rx_skb = sis_priv-rx_skbuff[entry]; + skb_put(rx_skb, rx_size); + skb-protocol = eth_type_trans(rx_skb, net_dev); + netif_rx(rx_skb); /* some network statistics */ if ((rx_status BCAST) == MCAST) This patch fixes the problem for me. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.21 hangs during post boot initialization phase
Neil Horman wrote: On Sat, Apr 28, 2007 at 12:28:28AM +1000, Peter Williams wrote: Neil Horman wrote: On Fri, Apr 27, 2007 at 04:05:11PM +1000, Peter Williams wrote: Damn, This is what happens when I try to do things too quickly. I missed one spot in my last patch where I replaced skb with rx_skb. Its not critical, but it should improve sis900 performance by quite a bit. This applies on top of the last two patches. Sorry about that. Thanks Regards Neil Signed-off-by: Neil Horman [EMAIL PROTECTED] sis900.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/sis900.c b/drivers/net/sis900.c index 7e44939..db59dce 100644 --- a/drivers/net/sis900.c +++ b/drivers/net/sis900.c @@ -1790,7 +1790,7 @@ static int sis900_rx(struct net_device *net_dev) /* give the socket buffer to upper layers */ rx_skb = sis_priv-rx_skbuff[entry]; skb_put(rx_skb, rx_size); - skb-protocol = eth_type_trans(rx_skb, net_dev); + rx_skb-protocol = eth_type_trans(rx_skb, net_dev); netif_rx(rx_skb); /* some network statistics */ My system also boots OK after I add this patch. Can't tell whether it's improved the performance or not. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.21 hangs during post boot initialization phase
Linus Torvalds wrote: On Fri, 27 Apr 2007, Peter Williams wrote: The 2.6.21 kernel is hanging during the post boot phase where various daemons are being started (not always the same daemon unfortunately). This problem was not present in 2.6.21-rc7 and there is no oops or other unusual output in the system log at the time the hang occurs. Can you use "git bisect" to narrow it down a bit more? It's only 125 commits, so bisecting even just three or four kernels will narrow it down to a handful Yes. I'm just in the process of reading up on how to do the bisecting now. Should have an answer in a few hours, I guess. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-2.6.21 hangs during post boot initialization phase
Linus Torvalds wrote: On Fri, 27 Apr 2007, Peter Williams wrote: The 2.6.21 kernel is hanging during the post boot phase where various daemons are being started (not always the same daemon unfortunately). This problem was not present in 2.6.21-rc7 and there is no oops or other unusual output in the system log at the time the hang occurs. Can you use git bisect to narrow it down a bit more? It's only 125 commits, so bisecting even just three or four kernels will narrow it down to a handful Yes. I'm just in the process of reading up on how to do the bisecting now. Should have an answer in a few hours, I guess. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Rogan Dawes wrote: Chris Friesen wrote: Rogan Dawes wrote: I guess my point was if we somehow get to an odd number of nanoseconds, we'd end up with rounding errors. I'm not sure if your algorithm will ever allow that. And Ingo's point was that when it takes thousands of nanoseconds for a single context switch, an error of half a nanosecond is down in the noise. Chris My concern was that since Ingo said that this is a closed economy, with a fixed sum/total, if we lose a nanosecond here and there, eventually we'll lose them all. Some folks have uptimes of multiple years. Of course, I could (very likely!) be full of it! ;-) And won't be using the any new scheduler on these computers anyhow as that would involve bringing the system down to install the new kernel. :-) Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Arjan van de Ven wrote: Within reason, it's not the number of clients that X has that causes its CPU bandwidth use to sky rocket and cause problems. It's more to to with what type of clients they are. Most GUIs (even ones that are constantly updating visual data (e.g. gkrellm -- I can open quite a large number of these without increasing X's CPU usage very much)) cause very little load on the X server. The exceptions to this are the there is actually 2 and not just 1 "X server", and they are VERY VERY different in behavior. Case 1: Accelerated driver If X talks to a decent enough card it supports will with acceleration, it will be very rare for X itself to spend any kind of significant amount of CPU time, all the really heavy stuff is done in hardware, and asynchronously at that. A bit of batching will greatly improve system performance in this case. Case 2: Unaccelerated VESA Some drivers in X, especially the VESA and NV drivers (which are quite common, vesa is used on all hardware without a special driver nowadays), have no or not enough acceleration to matter for modern desktops. This means the CPU is doing all the heavy lifting, in the X program. In this case even a simple "move the window a bit" becomes quite a bit of a CPU hog already. Mine's a: SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according to X's display settings tool. Which category does that fall into? It's not a special adapter and is just the one that came with the motherboard. It doesn't use much CPU unless I grab a window and wiggle it all over the screen or do something like "ls -lR /" in an xterm. The cases are fundamentally different in behavior, because in the first case, X hardly consumes the time it would get in any scheme, while in the second case X really is CPU bound and will happily consume any CPU time it can get. Which still doesn't justify an elaborate "points" sharing scheme. Whichever way you look at that that's just another way of giving X more CPU bandwidth and there are simpler ways to give X more CPU if it needs it. However, I think there's something seriously wrong if it needs the -19 nice that I've heard mentioned. You might as well just run it as a real time process. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Arjan van de Ven wrote: Within reason, it's not the number of clients that X has that causes its CPU bandwidth use to sky rocket and cause problems. It's more to to with what type of clients they are. Most GUIs (even ones that are constantly updating visual data (e.g. gkrellm -- I can open quite a large number of these without increasing X's CPU usage very much)) cause very little load on the X server. The exceptions to this are the there is actually 2 and not just 1 X server, and they are VERY VERY different in behavior. Case 1: Accelerated driver If X talks to a decent enough card it supports will with acceleration, it will be very rare for X itself to spend any kind of significant amount of CPU time, all the really heavy stuff is done in hardware, and asynchronously at that. A bit of batching will greatly improve system performance in this case. Case 2: Unaccelerated VESA Some drivers in X, especially the VESA and NV drivers (which are quite common, vesa is used on all hardware without a special driver nowadays), have no or not enough acceleration to matter for modern desktops. This means the CPU is doing all the heavy lifting, in the X program. In this case even a simple move the window a bit becomes quite a bit of a CPU hog already. Mine's a: SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according to X's display settings tool. Which category does that fall into? It's not a special adapter and is just the one that came with the motherboard. It doesn't use much CPU unless I grab a window and wiggle it all over the screen or do something like ls -lR / in an xterm. The cases are fundamentally different in behavior, because in the first case, X hardly consumes the time it would get in any scheme, while in the second case X really is CPU bound and will happily consume any CPU time it can get. Which still doesn't justify an elaborate points sharing scheme. Whichever way you look at that that's just another way of giving X more CPU bandwidth and there are simpler ways to give X more CPU if it needs it. However, I think there's something seriously wrong if it needs the -19 nice that I've heard mentioned. You might as well just run it as a real time process. Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Rogan Dawes wrote: Chris Friesen wrote: Rogan Dawes wrote: I guess my point was if we somehow get to an odd number of nanoseconds, we'd end up with rounding errors. I'm not sure if your algorithm will ever allow that. And Ingo's point was that when it takes thousands of nanoseconds for a single context switch, an error of half a nanosecond is down in the noise. Chris My concern was that since Ingo said that this is a closed economy, with a fixed sum/total, if we lose a nanosecond here and there, eventually we'll lose them all. Some folks have uptimes of multiple years. Of course, I could (very likely!) be full of it! ;-) And won't be using the any new scheduler on these computers anyhow as that would involve bringing the system down to install the new kernel. :-) Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Linus Torvalds wrote: On Mon, 23 Apr 2007, Ingo Molnar wrote: The "give scheduler money" transaction can be both an "implicit transaction" (for example when writing to UNIX domain sockets or blocking on a pipe, etc.), or it could be an "explicit transaction": sched_yield_to(). This latter i've already implemented for CFS, but it's much less useful than the really significant implicit ones, the ones which will help X. Yes. It would be wonderful to get it working automatically, so please say something about the implementation.. The "perfect" situation would be that when somebody goes to sleep, any extra points it had could be given to whoever it woke up last. Note that for something like X, it means that the points are 100% ephemeral: it gets points when a client sends it a request, but it would *lose* the points again when it sends the reply! So it would only accumulate "scheduling points" while multiuple clients are actively waiting for it, which actually sounds like exactly the right thing. However, I don't really see how to do it well, especially since the kernel cannot actually match up the client that gave some scheduling points to the reply that X sends back. There are subtle semantics with these kinds of things: especially if the scheduling points are only awarded when a process goes to sleep, if X is busy and continues to use the CPU (for another client), it wouldn't give any scheduling points back to clients and they really do accumulate with the server. Which again sounds like it would be exactly the right thing (both in the sense that the server that runs more gets more points, but also in the sense that we *only* give points at actual scheduling events). But how do you actually *give/track* points? A simple "last woken up by this process" thing that triggers when it goes to sleep? It might work, but on the other hand, especially with more complex things (and networking tends to be pretty complex) the actual wakeup may be done by a software irq. Do we just say "it ran within the context of X, so we assume X was the one that caused it?" It probably would work, but we've generally tried very hard to avoid accessing "current" from interrupt context, including bh's. Within reason, it's not the number of clients that X has that causes its CPU bandwidth use to sky rocket and cause problems. It's more to to with what type of clients they are. Most GUIs (even ones that are constantly updating visual data (e.g. gkrellm -- I can open quite a large number of these without increasing X's CPU usage very much)) cause very little load on the X server. The exceptions to this are the various terminal emulators (e.g. xterm, gnome-terminal, etc.) when being used to run output intensive command line programs e.g. try "ls -lR /" in an xterm. The other way (that I've noticed) X's CPU usage bandwidth sky rocket is when you grab a large window and wiggle it about a lot and hopefully this doesn't happen a lot so the problem that needs to be addressed is the one caused by text output on xterm and its ilk. So I think that an elaborate scheme for distributing "points" between X and its clients would be overkill. A good scheduler will make sure other tasks such as audio streamers get CPU when they need it with good responsiveness even when X takes off by giving them higher priority because their CPU bandwidth use is low. The one problem that might still be apparent in these cases is the mouse becoming jerky while X is working like crazy to spew out text too fast for anyone to read. But the only way to fix that is to give X more bandwidth but if it's already running at about 95% of a CPU that's unlikely to help. To fix this you would probably need to modify X so that it knows re-rendering the cursor is more important than rendering text in an xterm. In normal circumstances, the re-rendering of the mouse happens quickly enough for the user to experience good responsiveness because X's normal CPU use is low enough for it to be given high priority. Just because the O(1) tried this model and failed doesn't mean that the model is bad. O(1) was a flawed implementation of a good model. Peter PS Doing a kernel build in an xterm isn't an example of high enough output to cause a problem as (on my system) it only raises X's consumption from 0 to 2% to 2 to 5%. The type of output that causes the problem is usually flying past too fast to read. -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Linus Torvalds wrote: On Mon, 23 Apr 2007, Ingo Molnar wrote: The give scheduler money transaction can be both an implicit transaction (for example when writing to UNIX domain sockets or blocking on a pipe, etc.), or it could be an explicit transaction: sched_yield_to(). This latter i've already implemented for CFS, but it's much less useful than the really significant implicit ones, the ones which will help X. Yes. It would be wonderful to get it working automatically, so please say something about the implementation.. The perfect situation would be that when somebody goes to sleep, any extra points it had could be given to whoever it woke up last. Note that for something like X, it means that the points are 100% ephemeral: it gets points when a client sends it a request, but it would *lose* the points again when it sends the reply! So it would only accumulate scheduling points while multiuple clients are actively waiting for it, which actually sounds like exactly the right thing. However, I don't really see how to do it well, especially since the kernel cannot actually match up the client that gave some scheduling points to the reply that X sends back. There are subtle semantics with these kinds of things: especially if the scheduling points are only awarded when a process goes to sleep, if X is busy and continues to use the CPU (for another client), it wouldn't give any scheduling points back to clients and they really do accumulate with the server. Which again sounds like it would be exactly the right thing (both in the sense that the server that runs more gets more points, but also in the sense that we *only* give points at actual scheduling events). But how do you actually *give/track* points? A simple last woken up by this process thing that triggers when it goes to sleep? It might work, but on the other hand, especially with more complex things (and networking tends to be pretty complex) the actual wakeup may be done by a software irq. Do we just say it ran within the context of X, so we assume X was the one that caused it? It probably would work, but we've generally tried very hard to avoid accessing current from interrupt context, including bh's. Within reason, it's not the number of clients that X has that causes its CPU bandwidth use to sky rocket and cause problems. It's more to to with what type of clients they are. Most GUIs (even ones that are constantly updating visual data (e.g. gkrellm -- I can open quite a large number of these without increasing X's CPU usage very much)) cause very little load on the X server. The exceptions to this are the various terminal emulators (e.g. xterm, gnome-terminal, etc.) when being used to run output intensive command line programs e.g. try ls -lR / in an xterm. The other way (that I've noticed) X's CPU usage bandwidth sky rocket is when you grab a large window and wiggle it about a lot and hopefully this doesn't happen a lot so the problem that needs to be addressed is the one caused by text output on xterm and its ilk. So I think that an elaborate scheme for distributing points between X and its clients would be overkill. A good scheduler will make sure other tasks such as audio streamers get CPU when they need it with good responsiveness even when X takes off by giving them higher priority because their CPU bandwidth use is low. The one problem that might still be apparent in these cases is the mouse becoming jerky while X is working like crazy to spew out text too fast for anyone to read. But the only way to fix that is to give X more bandwidth but if it's already running at about 95% of a CPU that's unlikely to help. To fix this you would probably need to modify X so that it knows re-rendering the cursor is more important than rendering text in an xterm. In normal circumstances, the re-rendering of the mouse happens quickly enough for the user to experience good responsiveness because X's normal CPU use is low enough for it to be given high priority. Just because the O(1) tried this model and failed doesn't mean that the model is bad. O(1) was a flawed implementation of a good model. Peter PS Doing a kernel build in an xterm isn't an example of high enough output to cause a problem as (on my system) it only raises X's consumption from 0 to 2% to 2 to 5%. The type of output that causes the problem is usually flying past too fast to read. -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/