Re: [git] CFS-devel, group scheduler, fixes
On Mon, 24 Sep 2007, Peter Zijlstra wrote: On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote: On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote: On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: how about something like: s64 delta = (s64)(vruntime - min_vruntime); if (delta > 0) min_vruntime += delta; That would rid us of most of the funny conditionals there. That still left me with negative min_vruntimes. The pinned hogs didn't lock my box up, but I quickly got the below, so hastily killed it. Shouldn't the max() in place_entity() be the max_vruntime() that my lysdexia told me it was when I looked at it earlier? ;-) probably, my tree doesn't have that max anymore so I'm not sure. Last time I looked, I thought the max() in place_entity() was fine and the problem seemed to be in set_task_cpu() that was causing the negative vruntimes: if (likely(new_rq->cfs.min_vruntime)) p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime; I think it's fine we get rid of sync_vruntime(). I need to think more about how to make global fairness work based on this--it seems to be more complicated than if we had sync_vruntime(). tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote: > On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote: > > On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: > > > > > how about something like: > > > > > > s64 delta = (s64)(vruntime - min_vruntime); > > > if (delta > 0) > > > min_vruntime += delta; > > > > > > That would rid us of most of the funny conditionals there. > > > > That still left me with negative min_vruntimes. The pinned hogs didn't > > lock my box up, but I quickly got the below, so hastily killed it. > > Shouldn't the max() in place_entity() be the max_vruntime() that my > lysdexia told me it was when I looked at it earlier? ;-) > probably, my tree doesn't have that max anymore so I'm not sure. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 2007-09-24 at 13:08 +0200, Peter Zijlstra wrote: > Its perfectly valid for min_vruntime to exist in 1ULL << 63. But wrap backward timewarp is what's killing my box. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote: > On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: > > > how about something like: > > > > s64 delta = (s64)(vruntime - min_vruntime); > > if (delta > 0) > > min_vruntime += delta; > > > > That would rid us of most of the funny conditionals there. > > That still left me with negative min_vruntimes. The pinned hogs didn't > lock my box up, but I quickly got the below, so hastily killed it. Shouldn't the max() in place_entity() be the max_vruntime() that my lysdexia told me it was when I looked at it earlier? ;-) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 24 Sep 2007 12:42:15 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote: > On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: > > > how about something like: > > > > s64 delta = (s64)(vruntime - min_vruntime); > > if (delta > 0) > > min_vruntime += delta; > > > > That would rid us of most of the funny conditionals there. > > That still left me with negative min_vruntimes. The pinned hogs didn't > lock my box up, but I quickly got the below, so hastily killed it. > > se.wait_max : 7.846949 > se.wait_max : 301.951601 > se.wait_max : 7.071359 > Odd, the idea (which I think is clear) is that min_vruntime can wrap around the u64 spectrum. And by using min_vruntime as offset to base the key around, we get a signed but limited range key-space. (because we update min_vruntime to be the leftmost task (in a monotonic fashion)) So I'm having trouble with these patches, that is, both your wrap around condition of: if (likely(new_rq->cfs.min_vruntime)) as well as the last patchlet: if (((s64)vruntime > (s64)min_vruntime) || in that neither of these changes make sense in what its trying to do. Its perfectly valid for min_vruntime to exist in 1ULL << 63. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: > how about something like: > > s64 delta = (s64)(vruntime - min_vruntime); > if (delta > 0) > min_vruntime += delta; > > That would rid us of most of the funny conditionals there. That still left me with negative min_vruntimes. The pinned hogs didn't lock my box up, but I quickly got the below, so hastily killed it. se.wait_max : 7.846949 se.wait_max : 301.951601 se.wait_max : 7.071359 -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 24 Sep 2007 12:10:09 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote: > @@ -117,7 +117,7 @@ static inline struct task_struct *task_o > static inline u64 > max_vruntime(u64 min_vruntime, u64 vruntime) > { > - if ((vruntime > min_vruntime) || > + if (((s64)vruntime > (s64)min_vruntime) || > (min_vruntime > (1ULL << 61) && vruntime < (1ULL << 50))) > min_vruntime = vruntime; > how about something like: s64 delta = (s64)(vruntime - min_vruntime); if (delta > 0) min_vruntime += delta; That would rid us of most of the funny conditionals there. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Sun, 2007-09-23 at 23:21 -0700, Tong Li wrote: > On Sun, 23 Sep 2007, Mike Galbraith wrote: > > > On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote: > >> On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote: > >>> Mike, > >>> > >>> Could you try this patch to see if it solves the latency problem? > >> > >> No, but it helps some when running two un-pinned busy loops, one at nice > >> 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2 > >> _seconds_ doing this, and logging sched_debug and /proc/`pidof > >> Xorg`/sched from SCHED_RR shells. > > > > Looking at a log (snippet attached) from this morning with the last hunk > > of your patch still intact, it looks like min_vruntime is being modified > > after a task is queued. If you look at the snippet, you'll see the nice > > 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one > > second later on CPU1 with it's vruntime at 2814952.425082, but > > min_vruntime is 3061874.838356. > > I think this could be what was happening: between the two seconds, CPU 0 > becomes idle and it pulls the nice 19 task over via pull_task(), which > calls set_task_cpu(), which changes the task's vruntime to the current > min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 > calls activate_task(), which calls enqueue_task() and in turn > update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets > called and CPU 0's min_vruntime gets synced to the system max. Thus, the > nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think > this can be fixed by adding the following code in set_task_cpu() before we > adjust p->vruntime: > > if (!new_rq->cfs.nr_running) > sync_vruntime(new_rq); Hmm. I can imagine Mondo-Boxen-R-Us folks getting upset with that. Better would be like Ingo said, see if we can toss sync_vrintime(), and I've been playing with that... I found something this morning, and as usual, the darn thing turned out to be dirt simple. With sync_vruntime() disabled, I found queues with negative min_vruntime right from boot, and went hunting. Adding some instrumentation to set_task_cpu() (annoying consequences), I found the below: Legend: vrun: tasks's vruntime old: old queue's min_vruntime new: new queue's min_vruntime result: what's gonna happen [ 60.214508] kseriod vrun: 1427596999 old: 15070988657 new: 4065818654 res: -9577573004 [ 218.274661] konqueror vrun: 342076210254 old: 658982966338 new: 219203403598 res: -97703352486 [ 218.284657] init vrun: 411638725179 old: 659187735208 new: 219203492472 res: -28345517557 [...] A task which hasn't run in long enough for queues to have digressed further than it's vruntime is going to end up with a negative vruntime. Looking at place_entity(), it looks like it's supposed to fix that up, but due to the order of arguments passed to max_vrintime(), and the unsigned comparison therein, it won't. Running with the patchlet below, my box _so far_ has not become terminally unhappy despite spread0. I'm writing this with the pinned hogs test running right now, and all is well, so I _think_ it might be ok to just remove sync_vruntime() after all. diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c --- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-23 14:48:18.0 +0200 +++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-24 11:02:05.0 +0200 @@ -117,7 +117,7 @@ static inline struct task_struct *task_o static inline u64 max_vruntime(u64 min_vruntime, u64 vruntime) { - if ((vruntime > min_vruntime) || + if (((s64)vruntime > (s64)min_vruntime) || (min_vruntime > (1ULL << 61) && vruntime < (1ULL << 50))) min_vruntime = vruntime; @@ -310,7 +310,7 @@ static void update_curr(struct cfs_rq *c unsigned long delta_exec; if (unlikely(!cfs_rq->nr_running)) - return sync_vruntime(cfs_rq); + return ;//sync_vruntime(cfs_rq); if (unlikely(!curr)) return; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Sun, 23 Sep 2007, Mike Galbraith wrote: On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote: On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote: Mike, Could you try this patch to see if it solves the latency problem? No, but it helps some when running two un-pinned busy loops, one at nice 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2 _seconds_ doing this, and logging sched_debug and /proc/`pidof Xorg`/sched from SCHED_RR shells. Looking at a log (snippet attached) from this morning with the last hunk of your patch still intact, it looks like min_vruntime is being modified after a task is queued. If you look at the snippet, you'll see the nice 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one second later on CPU1 with it's vruntime at 2814952.425082, but min_vruntime is 3061874.838356. I think this could be what was happening: between the two seconds, CPU 0 becomes idle and it pulls the nice 19 task over via pull_task(), which calls set_task_cpu(), which changes the task's vruntime to the current min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 calls activate_task(), which calls enqueue_task() and in turn update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets called and CPU 0's min_vruntime gets synced to the system max. Thus, the nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think this can be fixed by adding the following code in set_task_cpu() before we adjust p->vruntime: if (!new_rq->cfs.nr_running) sync_vruntime(new_rq); static void sync_vruntime(struct cfs_rq *cfs_rq) { struct rq *rq; - int cpu; + int cpu, wrote = 0; for_each_online_cpu(cpu) { rq = _cpu(runqueues, cpu); + if (spin_is_locked(>lock)) + continue; + smp_wmb(); cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, rq->cfs.min_vruntime); + wrote++; } - schedstat_inc(cfs_rq, nr_sync_min_vruntime); + if (wrote) + schedstat_inc(cfs_rq, nr_sync_min_vruntime); } I think this rq->lock check works because it prevents the above scenario (CPU 0 is in pull_task so it must hold the rq lock). But my concern is that it may be too conservative, since sync_vruntime is called by update_curr, which mostly gets called in enqueue_task() and dequeue_task(), both of which are often invoked with the rq lock being held. Thus, if we don't allow sync_vruntime when rq lock is held, the sync will occur much less frequently and thus weaken cross-CPU fairness. tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Sun, 23 Sep 2007, Mike Galbraith wrote: On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote: On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote: Mike, Could you try this patch to see if it solves the latency problem? No, but it helps some when running two un-pinned busy loops, one at nice 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2 _seconds_ doing this, and logging sched_debug and /proc/`pidof Xorg`/sched from SCHED_RR shells. Looking at a log (snippet attached) from this morning with the last hunk of your patch still intact, it looks like min_vruntime is being modified after a task is queued. If you look at the snippet, you'll see the nice 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one second later on CPU1 with it's vruntime at 2814952.425082, but min_vruntime is 3061874.838356. I think this could be what was happening: between the two seconds, CPU 0 becomes idle and it pulls the nice 19 task over via pull_task(), which calls set_task_cpu(), which changes the task's vruntime to the current min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 calls activate_task(), which calls enqueue_task() and in turn update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets called and CPU 0's min_vruntime gets synced to the system max. Thus, the nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think this can be fixed by adding the following code in set_task_cpu() before we adjust p-vruntime: if (!new_rq-cfs.nr_running) sync_vruntime(new_rq); static void sync_vruntime(struct cfs_rq *cfs_rq) { struct rq *rq; - int cpu; + int cpu, wrote = 0; for_each_online_cpu(cpu) { rq = per_cpu(runqueues, cpu); + if (spin_is_locked(rq-lock)) + continue; + smp_wmb(); cfs_rq-min_vruntime = max_vruntime(cfs_rq-min_vruntime, rq-cfs.min_vruntime); + wrote++; } - schedstat_inc(cfs_rq, nr_sync_min_vruntime); + if (wrote) + schedstat_inc(cfs_rq, nr_sync_min_vruntime); } I think this rq-lock check works because it prevents the above scenario (CPU 0 is in pull_task so it must hold the rq lock). But my concern is that it may be too conservative, since sync_vruntime is called by update_curr, which mostly gets called in enqueue_task() and dequeue_task(), both of which are often invoked with the rq lock being held. Thus, if we don't allow sync_vruntime when rq lock is held, the sync will occur much less frequently and thus weaken cross-CPU fairness. tong - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Sun, 2007-09-23 at 23:21 -0700, Tong Li wrote: On Sun, 23 Sep 2007, Mike Galbraith wrote: On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote: On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote: Mike, Could you try this patch to see if it solves the latency problem? No, but it helps some when running two un-pinned busy loops, one at nice 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2 _seconds_ doing this, and logging sched_debug and /proc/`pidof Xorg`/sched from SCHED_RR shells. Looking at a log (snippet attached) from this morning with the last hunk of your patch still intact, it looks like min_vruntime is being modified after a task is queued. If you look at the snippet, you'll see the nice 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one second later on CPU1 with it's vruntime at 2814952.425082, but min_vruntime is 3061874.838356. I think this could be what was happening: between the two seconds, CPU 0 becomes idle and it pulls the nice 19 task over via pull_task(), which calls set_task_cpu(), which changes the task's vruntime to the current min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 calls activate_task(), which calls enqueue_task() and in turn update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets called and CPU 0's min_vruntime gets synced to the system max. Thus, the nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think this can be fixed by adding the following code in set_task_cpu() before we adjust p-vruntime: if (!new_rq-cfs.nr_running) sync_vruntime(new_rq); Hmm. I can imagine Mondo-Boxen-R-Us folks getting upset with that. Better would be like Ingo said, see if we can toss sync_vrintime(), and I've been playing with that... I found something this morning, and as usual, the darn thing turned out to be dirt simple. With sync_vruntime() disabled, I found queues with negative min_vruntime right from boot, and went hunting. Adding some instrumentation to set_task_cpu() (annoying consequences), I found the below: Legend: vrun: tasks's vruntime old: old queue's min_vruntime new: new queue's min_vruntime result: what's gonna happen [ 60.214508] kseriod vrun: 1427596999 old: 15070988657 new: 4065818654 res: -9577573004 [ 218.274661] konqueror vrun: 342076210254 old: 658982966338 new: 219203403598 res: -97703352486 [ 218.284657] init vrun: 411638725179 old: 659187735208 new: 219203492472 res: -28345517557 [...] A task which hasn't run in long enough for queues to have digressed further than it's vruntime is going to end up with a negative vruntime. Looking at place_entity(), it looks like it's supposed to fix that up, but due to the order of arguments passed to max_vrintime(), and the unsigned comparison therein, it won't. Running with the patchlet below, my box _so far_ has not become terminally unhappy despite spread0. I'm writing this with the pinned hogs test running right now, and all is well, so I _think_ it might be ok to just remove sync_vruntime() after all. diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c --- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-23 14:48:18.0 +0200 +++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-24 11:02:05.0 +0200 @@ -117,7 +117,7 @@ static inline struct task_struct *task_o static inline u64 max_vruntime(u64 min_vruntime, u64 vruntime) { - if ((vruntime min_vruntime) || + if (((s64)vruntime (s64)min_vruntime) || (min_vruntime (1ULL 61) vruntime (1ULL 50))) min_vruntime = vruntime; @@ -310,7 +310,7 @@ static void update_curr(struct cfs_rq *c unsigned long delta_exec; if (unlikely(!cfs_rq-nr_running)) - return sync_vruntime(cfs_rq); + return ;//sync_vruntime(cfs_rq); if (unlikely(!curr)) return; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 24 Sep 2007 12:10:09 +0200 Mike Galbraith [EMAIL PROTECTED] wrote: @@ -117,7 +117,7 @@ static inline struct task_struct *task_o static inline u64 max_vruntime(u64 min_vruntime, u64 vruntime) { - if ((vruntime min_vruntime) || + if (((s64)vruntime (s64)min_vruntime) || (min_vruntime (1ULL 61) vruntime (1ULL 50))) min_vruntime = vruntime; how about something like: s64 delta = (s64)(vruntime - min_vruntime); if (delta 0) min_vruntime += delta; That would rid us of most of the funny conditionals there. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: how about something like: s64 delta = (s64)(vruntime - min_vruntime); if (delta 0) min_vruntime += delta; That would rid us of most of the funny conditionals there. That still left me with negative min_vruntimes. The pinned hogs didn't lock my box up, but I quickly got the below, so hastily killed it. se.wait_max : 7.846949 se.wait_max : 301.951601 se.wait_max : 7.071359 -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 24 Sep 2007 12:42:15 +0200 Mike Galbraith [EMAIL PROTECTED] wrote: On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: how about something like: s64 delta = (s64)(vruntime - min_vruntime); if (delta 0) min_vruntime += delta; That would rid us of most of the funny conditionals there. That still left me with negative min_vruntimes. The pinned hogs didn't lock my box up, but I quickly got the below, so hastily killed it. se.wait_max : 7.846949 se.wait_max : 301.951601 se.wait_max : 7.071359 Odd, the idea (which I think is clear) is that min_vruntime can wrap around the u64 spectrum. And by using min_vruntime as offset to base the key around, we get a signed but limited range key-space. (because we update min_vruntime to be the leftmost task (in a monotonic fashion)) So I'm having trouble with these patches, that is, both your wrap around condition of: if (likely(new_rq-cfs.min_vruntime)) as well as the last patchlet: if (((s64)vruntime (s64)min_vruntime) || in that neither of these changes make sense in what its trying to do. Its perfectly valid for min_vruntime to exist in 1ULL 63. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote: On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: how about something like: s64 delta = (s64)(vruntime - min_vruntime); if (delta 0) min_vruntime += delta; That would rid us of most of the funny conditionals there. That still left me with negative min_vruntimes. The pinned hogs didn't lock my box up, but I quickly got the below, so hastily killed it. Shouldn't the max() in place_entity() be the max_vruntime() that my lysdexia told me it was when I looked at it earlier? ;-) -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 2007-09-24 at 13:08 +0200, Peter Zijlstra wrote: Its perfectly valid for min_vruntime to exist in 1ULL 63. But wrap backward timewarp is what's killing my box. -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith [EMAIL PROTECTED] wrote: On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote: On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: how about something like: s64 delta = (s64)(vruntime - min_vruntime); if (delta 0) min_vruntime += delta; That would rid us of most of the funny conditionals there. That still left me with negative min_vruntimes. The pinned hogs didn't lock my box up, but I quickly got the below, so hastily killed it. Shouldn't the max() in place_entity() be the max_vruntime() that my lysdexia told me it was when I looked at it earlier? ;-) probably, my tree doesn't have that max anymore so I'm not sure. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Mon, 24 Sep 2007, Peter Zijlstra wrote: On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith [EMAIL PROTECTED] wrote: On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote: On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote: how about something like: s64 delta = (s64)(vruntime - min_vruntime); if (delta 0) min_vruntime += delta; That would rid us of most of the funny conditionals there. That still left me with negative min_vruntimes. The pinned hogs didn't lock my box up, but I quickly got the below, so hastily killed it. Shouldn't the max() in place_entity() be the max_vruntime() that my lysdexia told me it was when I looked at it earlier? ;-) probably, my tree doesn't have that max anymore so I'm not sure. Last time I looked, I thought the max() in place_entity() was fine and the problem seemed to be in set_task_cpu() that was causing the negative vruntimes: if (likely(new_rq-cfs.min_vruntime)) p-se.vruntime -= old_rq-cfs.min_vruntime - new_rq-cfs.min_vruntime; I think it's fine we get rid of sync_vruntime(). I need to think more about how to make global fairness work based on this--it seems to be more complicated than if we had sync_vruntime(). tong - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
Hi, I haven't chased down the exact scenario, but using a min_vruntime which is about to change definitely seems to be what's causing my latency woes. Does the below cure your fairness woes as well? (first bit is just some debug format changes for your convenience if you try it) diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_debug.c linux-2.6.23-rc7.d/kernel/sched_debug.c --- git/linux-2.6.sched-devel/kernel/sched_debug.c 2007-09-22 13:37:32.0 +0200 +++ linux-2.6.23-rc7.d/kernel/sched_debug.c 2007-09-21 12:12:27.0 +0200 @@ -67,7 +67,7 @@ print_task(struct seq_file *m, struct rq (long long)(p->nvcsw + p->nivcsw), p->prio); #ifdef CONFIG_SCHEDSTATS - SEQ_printf(m, "%15Ld.%06ld %15Ld.%06ld %15Ld.%06ld\n", + SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld\n", SPLIT_NS(p->se.vruntime), SPLIT_NS(p->se.sum_exec_runtime), SPLIT_NS(p->se.sum_sleep_runtime)); @@ -83,10 +83,10 @@ static void print_rq(struct seq_file *m, SEQ_printf(m, "\nrunnable tasks:\n" - "task PIDtree-key switches prio" - "exec-runtimesum-exec sum-sleep\n" + "task PID tree-key switches prio" + " exec-runtime sum-execsum-sleep\n" "--" - ""); + "\n"); read_lock_irq(_lock); @@ -134,7 +134,7 @@ void print_cfs_rq(struct seq_file *m, in spread0 = min_vruntime - rq0_min_vruntime; SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread0", SPLIT_NS(spread0)); - SEQ_printf(m, " .%-30s: %ld\n", "spread0", + SEQ_printf(m, " .%-30s: %ld\n", "nr_sync_min_vruntime", cfs_rq->nr_sync_min_vruntime); } diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c --- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-22 13:37:32.0 +0200 +++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-23 12:53:11.0 +0200 @@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str static void sync_vruntime(struct cfs_rq *cfs_rq) { struct rq *rq; - int cpu; + int cpu, wrote = 0; for_each_online_cpu(cpu) { rq = _cpu(runqueues, cpu); + if (!spin_trylock(>lock)) + continue; cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, rq->cfs.min_vruntime); + spin_unlock(>lock); + wrote++; } - schedstat_inc(cfs_rq, nr_sync_min_vruntime); + if (wrote) + schedstat_inc(cfs_rq, nr_sync_min_vruntime); } static void update_curr(struct cfs_rq *cfs_rq) @@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c u64 now = rq_of(cfs_rq)->clock; unsigned long delta_exec; - if (unlikely(!curr)) + if (unlikely(!cfs_rq->nr_running)) return sync_vruntime(cfs_rq); + if (unlikely(!curr)) + return; /* * Get the amount of time the current task was running - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote: > On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote: > > Mike, > > > > Could you try this patch to see if it solves the latency problem? > > No, but it helps some when running two un-pinned busy loops, one at nice > 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2 > _seconds_ doing this, and logging sched_debug and /proc/`pidof > Xorg`/sched from SCHED_RR shells. Looking at a log (snippet attached) from this morning with the last hunk of your patch still intact, it looks like min_vruntime is being modified after a task is queued. If you look at the snippet, you'll see the nice 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one second later on CPU1 with it's vruntime at 2814952.425082, but min_vruntime is 3061874.838356. I took a hammer to it, and my latencies running this test went away. diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c --- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-22 13:37:32.0 +0200 +++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-23 08:29:38.0 +0200 @@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str static void sync_vruntime(struct cfs_rq *cfs_rq) { struct rq *rq; - int cpu; + int cpu, wrote = 0; for_each_online_cpu(cpu) { rq = _cpu(runqueues, cpu); + if (spin_is_locked(>lock)) + continue; + smp_wmb(); cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, rq->cfs.min_vruntime); + wrote++; } - schedstat_inc(cfs_rq, nr_sync_min_vruntime); + if (wrote) + schedstat_inc(cfs_rq, nr_sync_min_vruntime); } static void update_curr(struct cfs_rq *cfs_rq) @@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c u64 now = rq_of(cfs_rq)->clock; unsigned long delta_exec; - if (unlikely(!curr)) + if (unlikely(!cfs_rq->nr_running)) return sync_vruntime(cfs_rq); + if (unlikely(!curr)) + return; /* * Get the amount of time the current task was running Sched Debug Version: v0.05-v20, 2.6.23-rc7-smp-d #57 now at 682203.468012 msecs cpu#0, 2992.611 MHz .nr_running: 1 .load : 1024 .nr_switches : 196616 .nr_load_updates : 118674 .nr_uninterruptible: 4294966769 .jiffies : 382203 .next_balance : 0.382295 .curr->pid : 6322 .clock : 118656.313628 .idle_clock: 0.00 .prev_clock_raw: 705696.529916 .clock_warps : 0 .clock_overflows : 171665 .clock_deep_idle_events: 0 .clock_max_delta : 0.999848 .cpu_load[0] : 178561 .cpu_load[1] : 134173 .cpu_load[2] : 78694 .cpu_load[3] : 42634 .cpu_load[4] : 22523 cfs_rq .exec_clock: 71557.235360 .MIN_vruntime : 0.01 .min_vruntime : 2798832.982741 .max_vruntime : 0.01 .spread: 0.00 .spread0 : 0.00 .nr_sync_min_vruntime : 76120 runnable tasks: task PID tree-key switches prio exec-runtime sum-execsum-sleep -- R bash 6322 2798832.982741 603 120 2798832.982741 4423.956624 18419.152603 cpu#1, 2992.611 MHz .nr_running: 2 .load : 177537 .nr_switches : 154505 .nr_load_updates : 121679 .nr_uninterruptible: 527 .jiffies : 382203 .next_balance : 0.382234 .curr->pid : 6633 .clock : 121660.572179 .idle_clock: 0.00 .prev_clock_raw: 705696.537576 .clock_warps : 0 .clock_overflows : 127876 .clock_deep_idle_events: 0 .clock_max_delta : 0.999848 .cpu_load[0] : 177537 .cpu_load[1] : 155347 .cpu_load[2] : 102646 .cpu_load[3] : 58613 .cpu_load[4] : 31265 cfs_rq .exec_clock: 32989.995528 .MIN_vruntime : 3010385.345325 .min_vruntime : 3010385.345325 .max_vruntime : 3010385.345325 .spread
Re: [git] CFS-devel, group scheduler, fixes
On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote: On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote: Mike, Could you try this patch to see if it solves the latency problem? No, but it helps some when running two un-pinned busy loops, one at nice 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2 _seconds_ doing this, and logging sched_debug and /proc/`pidof Xorg`/sched from SCHED_RR shells. Looking at a log (snippet attached) from this morning with the last hunk of your patch still intact, it looks like min_vruntime is being modified after a task is queued. If you look at the snippet, you'll see the nice 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one second later on CPU1 with it's vruntime at 2814952.425082, but min_vruntime is 3061874.838356. I took a hammer to it, and my latencies running this test went away. diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c --- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-22 13:37:32.0 +0200 +++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-23 08:29:38.0 +0200 @@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str static void sync_vruntime(struct cfs_rq *cfs_rq) { struct rq *rq; - int cpu; + int cpu, wrote = 0; for_each_online_cpu(cpu) { rq = per_cpu(runqueues, cpu); + if (spin_is_locked(rq-lock)) + continue; + smp_wmb(); cfs_rq-min_vruntime = max_vruntime(cfs_rq-min_vruntime, rq-cfs.min_vruntime); + wrote++; } - schedstat_inc(cfs_rq, nr_sync_min_vruntime); + if (wrote) + schedstat_inc(cfs_rq, nr_sync_min_vruntime); } static void update_curr(struct cfs_rq *cfs_rq) @@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c u64 now = rq_of(cfs_rq)-clock; unsigned long delta_exec; - if (unlikely(!curr)) + if (unlikely(!cfs_rq-nr_running)) return sync_vruntime(cfs_rq); + if (unlikely(!curr)) + return; /* * Get the amount of time the current task was running Sched Debug Version: v0.05-v20, 2.6.23-rc7-smp-d #57 now at 682203.468012 msecs cpu#0, 2992.611 MHz .nr_running: 1 .load : 1024 .nr_switches : 196616 .nr_load_updates : 118674 .nr_uninterruptible: 4294966769 .jiffies : 382203 .next_balance : 0.382295 .curr-pid : 6322 .clock : 118656.313628 .idle_clock: 0.00 .prev_clock_raw: 705696.529916 .clock_warps : 0 .clock_overflows : 171665 .clock_deep_idle_events: 0 .clock_max_delta : 0.999848 .cpu_load[0] : 178561 .cpu_load[1] : 134173 .cpu_load[2] : 78694 .cpu_load[3] : 42634 .cpu_load[4] : 22523 cfs_rq .exec_clock: 71557.235360 .MIN_vruntime : 0.01 .min_vruntime : 2798832.982741 .max_vruntime : 0.01 .spread: 0.00 .spread0 : 0.00 .nr_sync_min_vruntime : 76120 runnable tasks: task PID tree-key switches prio exec-runtime sum-execsum-sleep -- R bash 6322 2798832.982741 603 120 2798832.982741 4423.956624 18419.152603 cpu#1, 2992.611 MHz .nr_running: 2 .load : 177537 .nr_switches : 154505 .nr_load_updates : 121679 .nr_uninterruptible: 527 .jiffies : 382203 .next_balance : 0.382234 .curr-pid : 6633 .clock : 121660.572179 .idle_clock: 0.00 .prev_clock_raw: 705696.537576 .clock_warps : 0 .clock_overflows : 127876 .clock_deep_idle_events: 0 .clock_max_delta : 0.999848 .cpu_load[0] : 177537 .cpu_load[1] : 155347 .cpu_load[2] : 102646 .cpu_load[3] : 58613 .cpu_load[4] : 31265 cfs_rq .exec_clock: 32989.995528 .MIN_vruntime : 3010385.345325 .min_vruntime : 3010385.345325 .max_vruntime : 3010385.345325 .spread: 0.00
Re: [git] CFS-devel, group scheduler, fixes
Hi, I haven't chased down the exact scenario, but using a min_vruntime which is about to change definitely seems to be what's causing my latency woes. Does the below cure your fairness woes as well? (first bit is just some debug format changes for your convenience if you try it) diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_debug.c linux-2.6.23-rc7.d/kernel/sched_debug.c --- git/linux-2.6.sched-devel/kernel/sched_debug.c 2007-09-22 13:37:32.0 +0200 +++ linux-2.6.23-rc7.d/kernel/sched_debug.c 2007-09-21 12:12:27.0 +0200 @@ -67,7 +67,7 @@ print_task(struct seq_file *m, struct rq (long long)(p-nvcsw + p-nivcsw), p-prio); #ifdef CONFIG_SCHEDSTATS - SEQ_printf(m, %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld\n, + SEQ_printf(m, %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld\n, SPLIT_NS(p-se.vruntime), SPLIT_NS(p-se.sum_exec_runtime), SPLIT_NS(p-se.sum_sleep_runtime)); @@ -83,10 +83,10 @@ static void print_rq(struct seq_file *m, SEQ_printf(m, \nrunnable tasks:\n - task PIDtree-key switches prio - exec-runtimesum-exec sum-sleep\n + task PID tree-key switches prio +exec-runtime sum-execsum-sleep\n -- - ); + \n); read_lock_irq(tasklist_lock); @@ -134,7 +134,7 @@ void print_cfs_rq(struct seq_file *m, in spread0 = min_vruntime - rq0_min_vruntime; SEQ_printf(m, .%-30s: %Ld.%06ld\n, spread0, SPLIT_NS(spread0)); - SEQ_printf(m, .%-30s: %ld\n, spread0, + SEQ_printf(m, .%-30s: %ld\n, nr_sync_min_vruntime, cfs_rq-nr_sync_min_vruntime); } diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c --- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-22 13:37:32.0 +0200 +++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-23 12:53:11.0 +0200 @@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str static void sync_vruntime(struct cfs_rq *cfs_rq) { struct rq *rq; - int cpu; + int cpu, wrote = 0; for_each_online_cpu(cpu) { rq = per_cpu(runqueues, cpu); + if (!spin_trylock(rq-lock)) + continue; cfs_rq-min_vruntime = max_vruntime(cfs_rq-min_vruntime, rq-cfs.min_vruntime); + spin_unlock(rq-lock); + wrote++; } - schedstat_inc(cfs_rq, nr_sync_min_vruntime); + if (wrote) + schedstat_inc(cfs_rq, nr_sync_min_vruntime); } static void update_curr(struct cfs_rq *cfs_rq) @@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c u64 now = rq_of(cfs_rq)-clock; unsigned long delta_exec; - if (unlikely(!curr)) + if (unlikely(!cfs_rq-nr_running)) return sync_vruntime(cfs_rq); + if (unlikely(!curr)) + return; /* * Get the amount of time the current task was running - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote: > Mike, > > Could you try this patch to see if it solves the latency problem? No, but it helps some when running two un-pinned busy loops, one at nice 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2 _seconds_ doing this, and logging sched_debug and /proc/`pidof Xorg`/sched from SCHED_RR shells. se.wait_max : 164.242748 se.wait_max : 121.996128 se.wait_max : 194.464773 se.wait_max : 517.425411 se.wait_max : 131.453214 se.wait_max : 122.984190 se.wait_max : 111.729274 se.wait_max : 119.567580 se.wait_max : 126.980696 se.wait_max : 177.241452 se.wait_max : 129.604329 se.wait_max : 119.631657 It doesn't help at all with this oddity while running same: [EMAIL PROTECTED]: now is the time fr aall good men to come to the aid of their country That was a nice 0 shell window. I'm not a great typist, but I ain't _that_ bad :) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote: Mike, Could you try this patch to see if it solves the latency problem? No, but it helps some when running two un-pinned busy loops, one at nice 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2 _seconds_ doing this, and logging sched_debug and /proc/`pidof Xorg`/sched from SCHED_RR shells. se.wait_max : 164.242748 se.wait_max : 121.996128 se.wait_max : 194.464773 se.wait_max : 517.425411 se.wait_max : 131.453214 se.wait_max : 122.984190 se.wait_max : 111.729274 se.wait_max : 119.567580 se.wait_max : 126.980696 se.wait_max : 177.241452 se.wait_max : 129.604329 se.wait_max : 119.631657 It doesn't help at all with this oddity while running same: [EMAIL PROTECTED]: now is the time fr aall good men to come to the aid of their country That was a nice 0 shell window. I'm not a great typist, but I ain't _that_ bad :) -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
Mike, Could you try this patch to see if it solves the latency problem? tong Changes: 1. Modified vruntime adjustment logic in set_task_cpu(). See comments in code. This fixed the negative vruntime problem. 2. This code in update_curr() seems to be wrong: if (unlikely(!curr)) return sync_vruntime(cfs_rq); cfs_rq->curr can be NULL even if cfs_rq->nr_running is non-zero (e.g., when an RT task is running). We only want to call sync_vruntime when cfs_rq->nr_running is 0. This fixed the large latency problem (at least in my tests). Signed-off-by: Tong Li <[EMAIL PROTECTED]> --- diff -uprN linux-2.6-sched-devel-orig/kernel/sched.c linux-2.6-sched-devel/kernel/sched.c --- linux-2.6-sched-devel-orig/kernel/sched.c 2007-09-20 12:15:41.0 -0700 +++ linux-2.6-sched-devel/kernel/sched.c2007-09-21 19:40:08.0 -0700 @@ -1033,9 +1033,20 @@ void set_task_cpu(struct task_struct *p, if (p->se.block_start) p->se.block_start -= clock_offset; #endif - if (likely(new_rq->cfs.min_vruntime)) - p->se.vruntime -= old_rq->cfs.min_vruntime - - new_rq->cfs.min_vruntime; + /* + * Reset p's vruntime if it moves to new_cpu whose min_vruntime is + * 100,000,000 (equivalent to 100ms for nice-0 tasks) larger or +* smaller than p's vruntime. This improves interactivity when +* pinned and unpinned tasks co-exist. For example, pinning a few +* tasks to a CPU can cause its min_vruntime much smaller than the +* other CPUs. If a task moves to this CPU, its vruntime can be so +* large it won't be scheduled until the locally pinned tasks' +* vruntimes catch up, causing large delays. +*/ + if (unlikely(old_cpu != new_cpu && p->se.vruntime && + (p->se.vruntime > new_rq->cfs.min_vruntime + 1 || + p->se.vruntime + 1 < new_rq->cfs.min_vruntime))) + p->se.vruntime = new_rq->cfs.min_vruntime; __set_task_cpu(p, new_cpu); } @@ -1599,6 +1610,7 @@ static void __sched_fork(struct task_str p->se.exec_start = 0; p->se.sum_exec_runtime = 0; p->se.prev_sum_exec_runtime = 0; + p->se.vruntime = 0; #ifdef CONFIG_SCHEDSTATS p->se.wait_start = 0; diff -uprN linux-2.6-sched-devel-orig/kernel/sched_fair.c linux-2.6-sched-devel/kernel/sched_fair.c --- linux-2.6-sched-devel-orig/kernel/sched_fair.c 2007-09-20 12:15:41.0 -0700 +++ linux-2.6-sched-devel/kernel/sched_fair.c 2007-09-21 17:23:09.0 -0700 @@ -306,8 +306,10 @@ static void update_curr(struct cfs_rq *c u64 now = rq_of(cfs_rq)->clock; unsigned long delta_exec; - if (unlikely(!curr)) + if (unlikely(!cfs_rq->nr_running)) return sync_vruntime(cfs_rq); + if (unlikely(!curr)) + return; /* * Get the amount of time the current task was running - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
Mike, Could you try this patch to see if it solves the latency problem? tong Changes: 1. Modified vruntime adjustment logic in set_task_cpu(). See comments in code. This fixed the negative vruntime problem. 2. This code in update_curr() seems to be wrong: if (unlikely(!curr)) return sync_vruntime(cfs_rq); cfs_rq-curr can be NULL even if cfs_rq-nr_running is non-zero (e.g., when an RT task is running). We only want to call sync_vruntime when cfs_rq-nr_running is 0. This fixed the large latency problem (at least in my tests). Signed-off-by: Tong Li [EMAIL PROTECTED] --- diff -uprN linux-2.6-sched-devel-orig/kernel/sched.c linux-2.6-sched-devel/kernel/sched.c --- linux-2.6-sched-devel-orig/kernel/sched.c 2007-09-20 12:15:41.0 -0700 +++ linux-2.6-sched-devel/kernel/sched.c2007-09-21 19:40:08.0 -0700 @@ -1033,9 +1033,20 @@ void set_task_cpu(struct task_struct *p, if (p-se.block_start) p-se.block_start -= clock_offset; #endif - if (likely(new_rq-cfs.min_vruntime)) - p-se.vruntime -= old_rq-cfs.min_vruntime - - new_rq-cfs.min_vruntime; + /* + * Reset p's vruntime if it moves to new_cpu whose min_vruntime is + * 100,000,000 (equivalent to 100ms for nice-0 tasks) larger or +* smaller than p's vruntime. This improves interactivity when +* pinned and unpinned tasks co-exist. For example, pinning a few +* tasks to a CPU can cause its min_vruntime much smaller than the +* other CPUs. If a task moves to this CPU, its vruntime can be so +* large it won't be scheduled until the locally pinned tasks' +* vruntimes catch up, causing large delays. +*/ + if (unlikely(old_cpu != new_cpu p-se.vruntime + (p-se.vruntime new_rq-cfs.min_vruntime + 1 || + p-se.vruntime + 1 new_rq-cfs.min_vruntime))) + p-se.vruntime = new_rq-cfs.min_vruntime; __set_task_cpu(p, new_cpu); } @@ -1599,6 +1610,7 @@ static void __sched_fork(struct task_str p-se.exec_start = 0; p-se.sum_exec_runtime = 0; p-se.prev_sum_exec_runtime = 0; + p-se.vruntime = 0; #ifdef CONFIG_SCHEDSTATS p-se.wait_start = 0; diff -uprN linux-2.6-sched-devel-orig/kernel/sched_fair.c linux-2.6-sched-devel/kernel/sched_fair.c --- linux-2.6-sched-devel-orig/kernel/sched_fair.c 2007-09-20 12:15:41.0 -0700 +++ linux-2.6-sched-devel/kernel/sched_fair.c 2007-09-21 17:23:09.0 -0700 @@ -306,8 +306,10 @@ static void update_curr(struct cfs_rq *c u64 now = rq_of(cfs_rq)-clock; unsigned long delta_exec; - if (unlikely(!curr)) + if (unlikely(!cfs_rq-nr_running)) return sync_vruntime(cfs_rq); + if (unlikely(!curr)) + return; /* * Get the amount of time the current task was running - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Fri, Sep 21, 2007 at 04:40:55AM +0200, Mike Galbraith wrote: > On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote: > > > I don't know if this is relevant, but 4294966399 in nr_uninterruptible > > for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible. > > I don't know if this rings a bell for someone or if it's a completely > > useless comment, but just in case... > > A task can block on one cpu, and wake up on another, which isn't > tracked, hence the fishy looking numbers. The true nr_uninterruptible > is the sum of all. OK, thanks Mike for this clarification. Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote: > I don't know if this is relevant, but 4294966399 in nr_uninterruptible > for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible. > I don't know if this rings a bell for someone or if it's a completely > useless comment, but just in case... A task can block on one cpu, and wake up on another, which isn't tracked, hence the fishy looking numbers. The true nr_uninterruptible is the sum of all. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Thu, Sep 20, 2007 at 09:15:07AM +0200, Mike Galbraith wrote: > But, I did just manage to trigger some horrid behavior, and log it. I > modified the kernel to print task's actual tree key instead of their > current vruntime, and was watching that while make -j2 was running (and > not seeing anything very interesting), when on a lark, I restarted > SuSE's system updater thingy. That beast chews 100% CPU for so long at > startup that I long ago got annoyed, and changed it to run at nice 19. > Anyway, when it started, interactivity went to hell in the proverbial > hand-basket, and the sched_debug log shows some interesting results.. > like spread0 hitting -13659412644, and cc1 being keyed at -3867063305. > > cpu#0, 2992.608 MHz > .nr_running: 4 > .load : 4096 > .nr_switches : 1105882 > .nr_load_updates : 735146 > .nr_uninterruptible: 4294966399 (...) > cpu#1, 2992.608 MHz > .nr_running: 5 > .load : 6208 > .nr_switches : 1012995 > .nr_load_updates : 747540 > .nr_uninterruptible: 897 I don't know if this is relevant, but 4294966399 in nr_uninterruptible for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible. I don't know if this rings a bell for someone or if it's a completely useless comment, but just in case... HTH, Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Thu, 2007-09-20 at 09:51 +0200, Ingo Molnar wrote: > * Mike Galbraith <[EMAIL PROTECTED]> wrote: > > > [...] I modified the kernel to print task's actual tree key instead > > of their current vruntime, [...] > > btw., that looks like a debug printout bug in sched-devel.git - could > you send me your fix? I've pushed out the latest sched-devel (ontop of > -rc7) to the usual place: Well, I temporarily added a key field, which is only used to store the key for debug... >git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git > > (note that there are some debug printout updates which might clash with > your fix) I'll pull/build this and test my migration problem there. All I have to do is to add a nice 19 chew-max to my make -j2, and all hell breaks loose. Always suspecting myself first in the search for dirty rotten SOB who broke local scheduler :) I nuked the migration fix. Looks like I'm not the SOB this time.. it got much worse, with max Xorg latency of >500ms. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
* Mike Galbraith <[EMAIL PROTECTED]> wrote: > [...] I modified the kernel to print task's actual tree key instead > of their current vruntime, [...] btw., that looks like a debug printout bug in sched-devel.git - could you send me your fix? I've pushed out the latest sched-devel (ontop of -rc7) to the usual place: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git (note that there are some debug printout updates which might clash with your fix) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Thu, 2007-09-20 at 06:55 +0200, Mike Galbraith wrote: > On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote: > > > Were the experiments run on a 2-CPU system? > > Yes. > > > When Xorg experiences large > > wait time, is it on the same CPU that has the two pinned tasks? If this is > > the case, then the problem could be X somehow advanced faster and got a > > larger vruntime then the two pinned tasks, so it had to wait for the > > pinned ones to catch up before it got a chance to be scheduled. > > Good question. I've not yet been able to capture any event where > vruntimes are that far apart in sched_debug. But, I did just manage to trigger some horrid behavior, and log it. I modified the kernel to print task's actual tree key instead of their current vruntime, and was watching that while make -j2 was running (and not seeing anything very interesting), when on a lark, I restarted SuSE's system updater thingy. That beast chews 100% CPU for so long at startup that I long ago got annoyed, and changed it to run at nice 19. Anyway, when it started, interactivity went to hell in the proverbial hand-basket, and the sched_debug log shows some interesting results.. like spread0 hitting -13659412644, and cc1 being keyed at -3867063305. cpu#0, 2992.608 MHz .nr_running: 4 .load : 4096 .nr_switches : 1105882 .nr_load_updates : 735146 .nr_uninterruptible: 4294966399 .jiffies : 1936201 .next_balance : 1936276 .curr->pid : 27004 .clock : 735034802877 .idle_clock: 0 .prev_clock_raw: 2259213083886 .clock_warps : 0 .clock_overflows : 677631 .clock_deep_idle_events: 0 .clock_max_delta : 999848 .cpu_load[0] : 4096 .cpu_load[1] : 3960 .cpu_load[2] : 3814 .cpu_load[3] : 3539 .cpu_load[4] : 3153 cfs_rq .exec_clock: 623175870707 .MIN_vruntime : 3791173262297 .min_vruntime : 3791173259723 .max_vruntime : 3791173264579 .spread: 2282 .spread0 : 0 .nr_sync_min_vruntime : 296756 runnable tasks: task PIDtree-key switches prioexec-runtime sum-exec sum-sleep -- bash 63383212 554 120 3791173262935 104271360645224098563 Rcc1 27004-7509103 6 120 3791165750620 1539602913300466 make 270064856 6 120 3791173264579 9540879 6458208 make 270102574 3 120 3791173262297 8897680 0 cpu#1, 2992.608 MHz .nr_running: 5 .load : 6208 .nr_switches : 1012995 .nr_load_updates : 747540 .nr_uninterruptible: 897 .jiffies : 1936201 .next_balance : 1936303 .curr->pid : 27012 .clock : 747426624818 .idle_clock: 0 .prev_clock_raw: 2259213140042 .clock_warps : 0 .clock_overflows : 582091 .clock_deep_idle_events: 0 .clock_max_delta : 999848 .cpu_load[0] : 6208 .cpu_load[1] : 4545 .cpu_load[2] : 3810 .cpu_load[3] : 3065 .cpu_load[4] : 2397 cfs_rq .exec_clock: 581940255731 .MIN_vruntime : 3791835890838 .min_vruntime : 3791854778322 .max_vruntime : 3791902652145 .spread: 66761307 .spread0 : 681518599 .nr_sync_min_vruntime : 256743 runnable tasks: task PIDtree-key switches prioexec-runtime sum-exec sum-sleep -- Xorg 5740 -18887484157025 115 3791835890838 46597617638567698716 update-status 2609348184396 1537 139 3791902652145 5709334592 2622161813 sh 27011-6808461 2 120 3791847969861 4162855 1457101 Rcat 27012 -19965525
Re: [git] CFS-devel, group scheduler, fixes
On Thu, 2007-09-20 at 06:55 +0200, Mike Galbraith wrote: On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote: Were the experiments run on a 2-CPU system? Yes. When Xorg experiences large wait time, is it on the same CPU that has the two pinned tasks? If this is the case, then the problem could be X somehow advanced faster and got a larger vruntime then the two pinned tasks, so it had to wait for the pinned ones to catch up before it got a chance to be scheduled. Good question. I've not yet been able to capture any event where vruntimes are that far apart in sched_debug. But, I did just manage to trigger some horrid behavior, and log it. I modified the kernel to print task's actual tree key instead of their current vruntime, and was watching that while make -j2 was running (and not seeing anything very interesting), when on a lark, I restarted SuSE's system updater thingy. That beast chews 100% CPU for so long at startup that I long ago got annoyed, and changed it to run at nice 19. Anyway, when it started, interactivity went to hell in the proverbial hand-basket, and the sched_debug log shows some interesting results.. like spread0 hitting -13659412644, and cc1 being keyed at -3867063305. cpu#0, 2992.608 MHz .nr_running: 4 .load : 4096 .nr_switches : 1105882 .nr_load_updates : 735146 .nr_uninterruptible: 4294966399 .jiffies : 1936201 .next_balance : 1936276 .curr-pid : 27004 .clock : 735034802877 .idle_clock: 0 .prev_clock_raw: 2259213083886 .clock_warps : 0 .clock_overflows : 677631 .clock_deep_idle_events: 0 .clock_max_delta : 999848 .cpu_load[0] : 4096 .cpu_load[1] : 3960 .cpu_load[2] : 3814 .cpu_load[3] : 3539 .cpu_load[4] : 3153 cfs_rq .exec_clock: 623175870707 .MIN_vruntime : 3791173262297 .min_vruntime : 3791173259723 .max_vruntime : 3791173264579 .spread: 2282 .spread0 : 0 .nr_sync_min_vruntime : 296756 runnable tasks: task PIDtree-key switches prioexec-runtime sum-exec sum-sleep -- bash 63383212 554 120 3791173262935 104271360645224098563 Rcc1 27004-7509103 6 120 3791165750620 1539602913300466 make 270064856 6 120 3791173264579 9540879 6458208 make 270102574 3 120 3791173262297 8897680 0 cpu#1, 2992.608 MHz .nr_running: 5 .load : 6208 .nr_switches : 1012995 .nr_load_updates : 747540 .nr_uninterruptible: 897 .jiffies : 1936201 .next_balance : 1936303 .curr-pid : 27012 .clock : 747426624818 .idle_clock: 0 .prev_clock_raw: 2259213140042 .clock_warps : 0 .clock_overflows : 582091 .clock_deep_idle_events: 0 .clock_max_delta : 999848 .cpu_load[0] : 6208 .cpu_load[1] : 4545 .cpu_load[2] : 3810 .cpu_load[3] : 3065 .cpu_load[4] : 2397 cfs_rq .exec_clock: 581940255731 .MIN_vruntime : 3791835890838 .min_vruntime : 3791854778322 .max_vruntime : 3791902652145 .spread: 66761307 .spread0 : 681518599 .nr_sync_min_vruntime : 256743 runnable tasks: task PIDtree-key switches prioexec-runtime sum-exec sum-sleep -- Xorg 5740 -18887484157025 115 3791835890838 46597617638567698716 update-status 2609348184396 1537 139 3791902652145 5709334592 2622161813 sh 27011-6808461 2 120 3791847969861 4162855 1457101 Rcat 27012 -19965525 3 120
Re: [git] CFS-devel, group scheduler, fixes
* Mike Galbraith [EMAIL PROTECTED] wrote: [...] I modified the kernel to print task's actual tree key instead of their current vruntime, [...] btw., that looks like a debug printout bug in sched-devel.git - could you send me your fix? I've pushed out the latest sched-devel (ontop of -rc7) to the usual place: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git (note that there are some debug printout updates which might clash with your fix) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Thu, 2007-09-20 at 09:51 +0200, Ingo Molnar wrote: * Mike Galbraith [EMAIL PROTECTED] wrote: [...] I modified the kernel to print task's actual tree key instead of their current vruntime, [...] btw., that looks like a debug printout bug in sched-devel.git - could you send me your fix? I've pushed out the latest sched-devel (ontop of -rc7) to the usual place: Well, I temporarily added a key field, which is only used to store the key for debug... git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git (note that there are some debug printout updates which might clash with your fix) I'll pull/build this and test my migration problem there. All I have to do is to add a nice 19 chew-max to my make -j2, and all hell breaks loose. Always suspecting myself first in the search for dirty rotten SOB who broke local scheduler :) I nuked the migration fix. Looks like I'm not the SOB this time.. it got much worse, with max Xorg latency of 500ms. -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Thu, Sep 20, 2007 at 09:15:07AM +0200, Mike Galbraith wrote: But, I did just manage to trigger some horrid behavior, and log it. I modified the kernel to print task's actual tree key instead of their current vruntime, and was watching that while make -j2 was running (and not seeing anything very interesting), when on a lark, I restarted SuSE's system updater thingy. That beast chews 100% CPU for so long at startup that I long ago got annoyed, and changed it to run at nice 19. Anyway, when it started, interactivity went to hell in the proverbial hand-basket, and the sched_debug log shows some interesting results.. like spread0 hitting -13659412644, and cc1 being keyed at -3867063305. cpu#0, 2992.608 MHz .nr_running: 4 .load : 4096 .nr_switches : 1105882 .nr_load_updates : 735146 .nr_uninterruptible: 4294966399 (...) cpu#1, 2992.608 MHz .nr_running: 5 .load : 6208 .nr_switches : 1012995 .nr_load_updates : 747540 .nr_uninterruptible: 897 I don't know if this is relevant, but 4294966399 in nr_uninterruptible for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible. I don't know if this rings a bell for someone or if it's a completely useless comment, but just in case... HTH, Willy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote: I don't know if this is relevant, but 4294966399 in nr_uninterruptible for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible. I don't know if this rings a bell for someone or if it's a completely useless comment, but just in case... A task can block on one cpu, and wake up on another, which isn't tracked, hence the fishy looking numbers. The true nr_uninterruptible is the sum of all. -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Fri, Sep 21, 2007 at 04:40:55AM +0200, Mike Galbraith wrote: On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote: I don't know if this is relevant, but 4294966399 in nr_uninterruptible for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible. I don't know if this rings a bell for someone or if it's a completely useless comment, but just in case... A task can block on one cpu, and wake up on another, which isn't tracked, hence the fishy looking numbers. The true nr_uninterruptible is the sum of all. OK, thanks Mike for this clarification. Willy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote: > Were the experiments run on a 2-CPU system? Yes. > When Xorg experiences large > wait time, is it on the same CPU that has the two pinned tasks? If this is > the case, then the problem could be X somehow advanced faster and got a > larger vruntime then the two pinned tasks, so it had to wait for the > pinned ones to catch up before it got a chance to be scheduled. Good question. I've not yet been able to capture any event where vruntimes are that far apart in sched_debug. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 19 Sep 2007, Siddha, Suresh B wrote: On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote: This patch attempts to improve CFS's SMP global fairness based on the new virtual time design. Removed vruntime adjustment in set_task_cpu() as it skews global fairness. Modified small_imbalance logic in find_busiest_group(). If there's small imbalance, move tasks from busiest to local sched_group only if the local group contains a CPU whose min_vruntime is the maximum among all CPUs in the same sched_domain. This prevents any CPU from advancing too far ahead in virtual time and avoids tasks thrashing between two CPUs without utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, since the load is not evenly divisible by the number of CPUs, we want the extra load to have a fair use of every CPU in the system. Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task Just as an experiment, can you run 82 tasks on 8 CPUs. Current imbalance_pct logic will not detect and fix the global fairness issue even with this patch. Right. For 82 tasks on 8 CPUs, the max error is 7.07% and min is -14.12%. It could be fixed by removing the imbalance_pct check (i.e., making imbalance_pct effectively 1). The cost would be more migrations, so I wasn't sure if that was the approach people would agree on. if (*imbalance < busiest_load_per_task) { - unsigned long tmp, pwr_now, pwr_move; - unsigned int imbn; - small_imbalance: - pwr_move = pwr_now = 0; - imbn = 2; - if (this_nr_running) { - this_load_per_task /= this_nr_running; - if (busiest_load_per_task > this_load_per_task) - imbn = 1; - } else - this_load_per_task = SCHED_LOAD_SCALE; - - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >= - busiest_load_per_task * imbn) { - *imbalance = busiest_load_per_task; - return busiest; - } This patch removes quite a few lines and all this is logic is not for fairness :( Especially the above portion handles some of the HT/MC optimizations. What are the optimizations? I was trying to simplify the code to use only vruntime to control balancing when there's small imbalance, but if it breaks non-fairness related optimizations, we can certainly add the code back. tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote: > This patch attempts to improve CFS's SMP global fairness based on the new > virtual time design. > > Removed vruntime adjustment in set_task_cpu() as it skews global fairness. > > Modified small_imbalance logic in find_busiest_group(). If there's small > imbalance, move tasks from busiest to local sched_group only if the local > group contains a CPU whose min_vruntime is the maximum among all CPUs in > the same sched_domain. This prevents any CPU from advancing too far ahead > in virtual time and avoids tasks thrashing between two CPUs without > utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, > since the load is not evenly divisible by the number of CPUs, we want the > extra load to have a fair use of every CPU in the system. > > Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task Just as an experiment, can you run 82 tasks on 8 CPUs. Current imbalance_pct logic will not detect and fix the global fairness issue even with this patch. > if (*imbalance < busiest_load_per_task) { > - unsigned long tmp, pwr_now, pwr_move; > - unsigned int imbn; > - > small_imbalance: > - pwr_move = pwr_now = 0; > - imbn = 2; > - if (this_nr_running) { > - this_load_per_task /= this_nr_running; > - if (busiest_load_per_task > this_load_per_task) > - imbn = 1; > - } else > - this_load_per_task = SCHED_LOAD_SCALE; > - > - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >= > - busiest_load_per_task * imbn) { > - *imbalance = busiest_load_per_task; > - return busiest; > - } This patch removes quite a few lines and all this is logic is not for fairness :( Especially the above portion handles some of the HT/MC optimizations. > - > - /* > - * OK, we don't have enough imbalance to justify moving > tasks, > - * however we may be able to increase total CPU power used by > - * moving them. > + /* > + * When there's small imbalance, move tasks only if this > + * sched_group contains a CPU whose min_vruntime is the > + * maximum among all CPUs in the same domain. >*/ > - > - pwr_now += busiest->__cpu_power * > - min(busiest_load_per_task, max_load); > - pwr_now += this->__cpu_power * > - min(this_load_per_task, this_load); > - pwr_now /= SCHED_LOAD_SCALE; > - > - /* Amount of load we'd subtract */ > - tmp = sg_div_cpu_power(busiest, > - busiest_load_per_task * SCHED_LOAD_SCALE); > - if (max_load > tmp) > - pwr_move += busiest->__cpu_power * > - min(busiest_load_per_task, max_load - tmp); > - > - /* Amount of load we'd add */ > - if (max_load * busiest->__cpu_power < > - busiest_load_per_task * SCHED_LOAD_SCALE) > - tmp = sg_div_cpu_power(this, > - max_load * busiest->__cpu_power); > - else > - tmp = sg_div_cpu_power(this, > - busiest_load_per_task * SCHED_LOAD_SCALE); > - pwr_move += this->__cpu_power * > - min(this_load_per_task, this_load + tmp); > - pwr_move /= SCHED_LOAD_SCALE; > - > - /* Move if we gain throughput */ > - if (pwr_move > pwr_now) > + if (max_vruntime_group == this) > *imbalance = busiest_load_per_task; > + else > + *imbalance = 0; Not sure how this all interacts when some of the cpu's are idle. I have to look more closely. thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 19 Sep 2007, Mike Galbraith wrote: On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote: The scenario which was previously cured was this: taskset -c 1 nice -n 0 ./massive_intr 2 taskset -c 1 nice -n 5 ./massive_intr 2 click link (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring up browser and OpenOffice Impress. Xorg (at nice -5 + above scenario) latency samples: se.wait_max :57985337 se.wait_max :25163510 se.wait_max :37005538 se.wait_max :66986511 se.wait_max :53990868 se.wait_max :80976761 se.wait_max :96967501 se.wait_max :80989254 se.wait_max :53990897 se.wait_max : 181963905 se.wait_max :85985181 To be doubly sure of the effect on the pinned tasks + migrating Xorg scenario, I just ran the above test 10 times with virgin devel source. Maximum Xorg latency was 20ms. -Mike Yeah, the patch was a first attempt at getting better global fairness for unpinned tasks. I expected there'd be latency problems when unpinned and pinned tasks co-exist. The original code for vruntime adjustment in set_task_cpu() helped alleviate this problem, so removing it in my patch would definitely bring the problem back. On the other hand, leaving it there broke global fairness in my fairness benchmark. So we'd need a better compromise. Were the experiments run on a 2-CPU system? When Xorg experiences large wait time, is it on the same CPU that has the two pinned tasks? If this is the case, then the problem could be X somehow advanced faster and got a larger vruntime then the two pinned tasks, so it had to wait for the pinned ones to catch up before it got a chance to be scheduled. tong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote: > The scenario which was previously cured was this: > taskset -c 1 nice -n 0 ./massive_intr 2 > taskset -c 1 nice -n 5 ./massive_intr 2 > click link > (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring > up browser and OpenOffice Impress. > > Xorg (at nice -5 + above scenario) latency samples: > se.wait_max :57985337 > se.wait_max :25163510 > se.wait_max :37005538 > se.wait_max :66986511 > se.wait_max :53990868 > se.wait_max :80976761 > se.wait_max :96967501 > se.wait_max :80989254 > se.wait_max :53990897 > se.wait_max : 181963905 > se.wait_max :85985181 To be doubly sure of the effect on the pinned tasks + migrating Xorg scenario, I just ran the above test 10 times with virgin devel source. Maximum Xorg latency was 20ms. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 2007-09-19 at 08:28 +0200, Mike Galbraith wrote: > Greetings, > > On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote: > > This patch attempts to improve CFS's SMP global fairness based on the new > > virtual time design. > > > > Removed vruntime adjustment in set_task_cpu() as it skews global fairness. > > Since I'm (still) encountering Xorg latency issues (which go away if > load is hefty instead of light) even with that migration adjustment and > synchronization, and am having difficulty nailing it down to a specific > event, I'll test this immediately. (had to apply manually to freshly pulled tree) Drat. This didn't cure the latency hits with a Xorg at nice -5 running with a make -j2 at nice 0, but seems to have reinstated a latency issue which was previously cured. Xorg 1 sec. max latency samples: (trimmed to only show >20ms latencies) se.wait_max :23343582 se.wait_max :20119460 se.wait_max :20771573 se.wait_max :21084567 se.wait_max :31338500 se.wait_max :35368148 se.wait_max :39199642 se.wait_max :22889062 se.wait_max :40285501 se.wait_max :21002720 se.wait_max :21002266 se.wait_max :21680578 se.wait_max :22012913 se.wait_max :94646331 se.wait_max :29003693 se.wait_max :20812613 (boot with maxcpus=1 or nail X+make to one cpu and these latencies are gone, so it does seem to be the migration logic - why i was so interested in testing your patch) The scenario which was previously cured was this: taskset -c 1 nice -n 0 ./massive_intr 2 taskset -c 1 nice -n 5 ./massive_intr 2 click link (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring up browser and OpenOffice Impress. Xorg (at nice -5 + above scenario) latency samples: se.wait_max :57985337 se.wait_max :25163510 se.wait_max :37005538 se.wait_max :66986511 se.wait_max :53990868 se.wait_max :80976761 se.wait_max :96967501 se.wait_max :80989254 se.wait_max :53990897 se.wait_max : 181963905 se.wait_max :85985181 -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
Greetings, On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote: > This patch attempts to improve CFS's SMP global fairness based on the new > virtual time design. > > Removed vruntime adjustment in set_task_cpu() as it skews global fairness. Since I'm (still) encountering Xorg latency issues (which go away if load is hefty instead of light) even with that migration adjustment and synchronization, and am having difficulty nailing it down to a specific event, I'll test this immediately. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
This patch attempts to improve CFS's SMP global fairness based on the new virtual time design. Removed vruntime adjustment in set_task_cpu() as it skews global fairness. Modified small_imbalance logic in find_busiest_group(). If there's small imbalance, move tasks from busiest to local sched_group only if the local group contains a CPU whose min_vruntime is the maximum among all CPUs in the same sched_domain. This prevents any CPU from advancing too far ahead in virtual time and avoids tasks thrashing between two CPUs without utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, since the load is not evenly divisible by the number of CPUs, we want the extra load to have a fair use of every CPU in the system. Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task runs a trivial while (1) loop. The benchmark runs for 300 seconds and, at every T seconds, it samples for each task the following: 1. Actual CPU time the task received during the past 60 seconds. 2. Ideal CPU time it would receive under a perfect fair scheduler. 3. Lag = ideal time - actual time, where a positive lag means the task received less CPU time than its fair share and negative means it received more. 4. Error = lag / ideal time The following shows the max and min errors among all samples for all tasks before and after applying the patch: Before: Sampling interval: 30 s Max error: 100.00% Min error: -25.00% Sampling interval: 10 s Max error: 27.62% Min error: -25.00% After: Sampling interval: 30 s Max error: 1.33% Min error: -1.29% Sampling interval: 10 s Max error: 7.38% Min error: -6.25% The errors for the 10s sampling interval are still not as small as I had hoped for, but looks like it does have some improvement. tong Signed-off-by: Tong Li <[EMAIL PROTECTED]> --- --- linux-2.6-sched-devel-orig/kernel/sched.c 2007-09-15 22:00:48.0 -0700 +++ linux-2.6-sched-devel/kernel/sched.c2007-09-18 22:10:52.0 -0700 @@ -1033,9 +1033,6 @@ void set_task_cpu(struct task_struct *p, if (p->se.block_start) p->se.block_start -= clock_offset; #endif - if (likely(new_rq->cfs.min_vruntime)) - p->se.vruntime -= old_rq->cfs.min_vruntime - - new_rq->cfs.min_vruntime; __set_task_cpu(p, new_cpu); } @@ -1599,6 +1596,7 @@ static void __sched_fork(struct task_str p->se.exec_start = 0; p->se.sum_exec_runtime = 0; p->se.prev_sum_exec_runtime = 0; + p->se.vruntime = 0; #ifdef CONFIG_SCHEDSTATS p->se.wait_start = 0; @@ -2277,6 +2275,8 @@ find_busiest_group(struct sched_domain * int *sd_idle, cpumask_t *cpus, int *balance) { struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups; + struct sched_group *max_vruntime_group = NULL; + u64 max_vruntime = 0; unsigned long max_load, avg_load, total_load, this_load, total_pwr; unsigned long max_pull; unsigned long busiest_load_per_task, busiest_nr_running; @@ -2322,6 +2322,11 @@ find_busiest_group(struct sched_domain * rq = cpu_rq(i); + if (rq->cfs.min_vruntime > max_vruntime) { + max_vruntime = rq->cfs.min_vruntime; + max_vruntime_group = group; + } + if (*sd_idle && rq->nr_running) *sd_idle = 0; @@ -2483,59 +2488,16 @@ group_next: * moved */ if (*imbalance < busiest_load_per_task) { - unsigned long tmp, pwr_now, pwr_move; - unsigned int imbn; - small_imbalance: - pwr_move = pwr_now = 0; - imbn = 2; - if (this_nr_running) { - this_load_per_task /= this_nr_running; - if (busiest_load_per_task > this_load_per_task) - imbn = 1; - } else - this_load_per_task = SCHED_LOAD_SCALE; - - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >= - busiest_load_per_task * imbn) { - *imbalance = busiest_load_per_task; - return busiest; - } - - /* -* OK, we don't have enough imbalance to justify moving tasks, -* however we may be able to increase total CPU power used by -* moving them. + /* + * When there's small imbalance, move tasks only if this + * sched_group contains a CPU whose min_vruntime is the + * maximum among all CPUs in the same domain. */ - - pwr_now += busiest->__cpu_power * - min(busiest_load_per_task, max_load); -
Re: [git] CFS-devel, group scheduler, fixes
This patch attempts to improve CFS's SMP global fairness based on the new virtual time design. Removed vruntime adjustment in set_task_cpu() as it skews global fairness. Modified small_imbalance logic in find_busiest_group(). If there's small imbalance, move tasks from busiest to local sched_group only if the local group contains a CPU whose min_vruntime is the maximum among all CPUs in the same sched_domain. This prevents any CPU from advancing too far ahead in virtual time and avoids tasks thrashing between two CPUs without utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, since the load is not evenly divisible by the number of CPUs, we want the extra load to have a fair use of every CPU in the system. Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task runs a trivial while (1) loop. The benchmark runs for 300 seconds and, at every T seconds, it samples for each task the following: 1. Actual CPU time the task received during the past 60 seconds. 2. Ideal CPU time it would receive under a perfect fair scheduler. 3. Lag = ideal time - actual time, where a positive lag means the task received less CPU time than its fair share and negative means it received more. 4. Error = lag / ideal time The following shows the max and min errors among all samples for all tasks before and after applying the patch: Before: Sampling interval: 30 s Max error: 100.00% Min error: -25.00% Sampling interval: 10 s Max error: 27.62% Min error: -25.00% After: Sampling interval: 30 s Max error: 1.33% Min error: -1.29% Sampling interval: 10 s Max error: 7.38% Min error: -6.25% The errors for the 10s sampling interval are still not as small as I had hoped for, but looks like it does have some improvement. tong Signed-off-by: Tong Li [EMAIL PROTECTED] --- --- linux-2.6-sched-devel-orig/kernel/sched.c 2007-09-15 22:00:48.0 -0700 +++ linux-2.6-sched-devel/kernel/sched.c2007-09-18 22:10:52.0 -0700 @@ -1033,9 +1033,6 @@ void set_task_cpu(struct task_struct *p, if (p-se.block_start) p-se.block_start -= clock_offset; #endif - if (likely(new_rq-cfs.min_vruntime)) - p-se.vruntime -= old_rq-cfs.min_vruntime - - new_rq-cfs.min_vruntime; __set_task_cpu(p, new_cpu); } @@ -1599,6 +1596,7 @@ static void __sched_fork(struct task_str p-se.exec_start = 0; p-se.sum_exec_runtime = 0; p-se.prev_sum_exec_runtime = 0; + p-se.vruntime = 0; #ifdef CONFIG_SCHEDSTATS p-se.wait_start = 0; @@ -2277,6 +2275,8 @@ find_busiest_group(struct sched_domain * int *sd_idle, cpumask_t *cpus, int *balance) { struct sched_group *busiest = NULL, *this = NULL, *group = sd-groups; + struct sched_group *max_vruntime_group = NULL; + u64 max_vruntime = 0; unsigned long max_load, avg_load, total_load, this_load, total_pwr; unsigned long max_pull; unsigned long busiest_load_per_task, busiest_nr_running; @@ -2322,6 +2322,11 @@ find_busiest_group(struct sched_domain * rq = cpu_rq(i); + if (rq-cfs.min_vruntime max_vruntime) { + max_vruntime = rq-cfs.min_vruntime; + max_vruntime_group = group; + } + if (*sd_idle rq-nr_running) *sd_idle = 0; @@ -2483,59 +2488,16 @@ group_next: * moved */ if (*imbalance busiest_load_per_task) { - unsigned long tmp, pwr_now, pwr_move; - unsigned int imbn; - small_imbalance: - pwr_move = pwr_now = 0; - imbn = 2; - if (this_nr_running) { - this_load_per_task /= this_nr_running; - if (busiest_load_per_task this_load_per_task) - imbn = 1; - } else - this_load_per_task = SCHED_LOAD_SCALE; - - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ = - busiest_load_per_task * imbn) { - *imbalance = busiest_load_per_task; - return busiest; - } - - /* -* OK, we don't have enough imbalance to justify moving tasks, -* however we may be able to increase total CPU power used by -* moving them. + /* + * When there's small imbalance, move tasks only if this + * sched_group contains a CPU whose min_vruntime is the + * maximum among all CPUs in the same domain. */ - - pwr_now += busiest-__cpu_power * - min(busiest_load_per_task, max_load); - pwr_now +=
Re: [git] CFS-devel, group scheduler, fixes
Greetings, On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote: This patch attempts to improve CFS's SMP global fairness based on the new virtual time design. Removed vruntime adjustment in set_task_cpu() as it skews global fairness. Since I'm (still) encountering Xorg latency issues (which go away if load is hefty instead of light) even with that migration adjustment and synchronization, and am having difficulty nailing it down to a specific event, I'll test this immediately. -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 2007-09-19 at 08:28 +0200, Mike Galbraith wrote: Greetings, On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote: This patch attempts to improve CFS's SMP global fairness based on the new virtual time design. Removed vruntime adjustment in set_task_cpu() as it skews global fairness. Since I'm (still) encountering Xorg latency issues (which go away if load is hefty instead of light) even with that migration adjustment and synchronization, and am having difficulty nailing it down to a specific event, I'll test this immediately. (had to apply manually to freshly pulled tree) Drat. This didn't cure the latency hits with a Xorg at nice -5 running with a make -j2 at nice 0, but seems to have reinstated a latency issue which was previously cured. Xorg 1 sec. max latency samples: (trimmed to only show 20ms latencies) se.wait_max :23343582 se.wait_max :20119460 se.wait_max :20771573 se.wait_max :21084567 se.wait_max :31338500 se.wait_max :35368148 se.wait_max :39199642 se.wait_max :22889062 se.wait_max :40285501 se.wait_max :21002720 se.wait_max :21002266 se.wait_max :21680578 se.wait_max :22012913 se.wait_max :94646331 se.wait_max :29003693 se.wait_max :20812613 (boot with maxcpus=1 or nail X+make to one cpu and these latencies are gone, so it does seem to be the migration logic - why i was so interested in testing your patch) The scenario which was previously cured was this: taskset -c 1 nice -n 0 ./massive_intr 2 taskset -c 1 nice -n 5 ./massive_intr 2 click link (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring up browser and OpenOffice Impress. Xorg (at nice -5 + above scenario) latency samples: se.wait_max :57985337 se.wait_max :25163510 se.wait_max :37005538 se.wait_max :66986511 se.wait_max :53990868 se.wait_max :80976761 se.wait_max :96967501 se.wait_max :80989254 se.wait_max :53990897 se.wait_max : 181963905 se.wait_max :85985181 -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote: The scenario which was previously cured was this: taskset -c 1 nice -n 0 ./massive_intr 2 taskset -c 1 nice -n 5 ./massive_intr 2 click link (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring up browser and OpenOffice Impress. Xorg (at nice -5 + above scenario) latency samples: se.wait_max :57985337 se.wait_max :25163510 se.wait_max :37005538 se.wait_max :66986511 se.wait_max :53990868 se.wait_max :80976761 se.wait_max :96967501 se.wait_max :80989254 se.wait_max :53990897 se.wait_max : 181963905 se.wait_max :85985181 To be doubly sure of the effect on the pinned tasks + migrating Xorg scenario, I just ran the above test 10 times with virgin devel source. Maximum Xorg latency was 20ms. -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 19 Sep 2007, Mike Galbraith wrote: On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote: The scenario which was previously cured was this: taskset -c 1 nice -n 0 ./massive_intr 2 taskset -c 1 nice -n 5 ./massive_intr 2 click link (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring up browser and OpenOffice Impress. Xorg (at nice -5 + above scenario) latency samples: se.wait_max :57985337 se.wait_max :25163510 se.wait_max :37005538 se.wait_max :66986511 se.wait_max :53990868 se.wait_max :80976761 se.wait_max :96967501 se.wait_max :80989254 se.wait_max :53990897 se.wait_max : 181963905 se.wait_max :85985181 To be doubly sure of the effect on the pinned tasks + migrating Xorg scenario, I just ran the above test 10 times with virgin devel source. Maximum Xorg latency was 20ms. -Mike Yeah, the patch was a first attempt at getting better global fairness for unpinned tasks. I expected there'd be latency problems when unpinned and pinned tasks co-exist. The original code for vruntime adjustment in set_task_cpu() helped alleviate this problem, so removing it in my patch would definitely bring the problem back. On the other hand, leaving it there broke global fairness in my fairness benchmark. So we'd need a better compromise. Were the experiments run on a 2-CPU system? When Xorg experiences large wait time, is it on the same CPU that has the two pinned tasks? If this is the case, then the problem could be X somehow advanced faster and got a larger vruntime then the two pinned tasks, so it had to wait for the pinned ones to catch up before it got a chance to be scheduled. tong - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote: This patch attempts to improve CFS's SMP global fairness based on the new virtual time design. Removed vruntime adjustment in set_task_cpu() as it skews global fairness. Modified small_imbalance logic in find_busiest_group(). If there's small imbalance, move tasks from busiest to local sched_group only if the local group contains a CPU whose min_vruntime is the maximum among all CPUs in the same sched_domain. This prevents any CPU from advancing too far ahead in virtual time and avoids tasks thrashing between two CPUs without utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, since the load is not evenly divisible by the number of CPUs, we want the extra load to have a fair use of every CPU in the system. Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task Just as an experiment, can you run 82 tasks on 8 CPUs. Current imbalance_pct logic will not detect and fix the global fairness issue even with this patch. if (*imbalance busiest_load_per_task) { - unsigned long tmp, pwr_now, pwr_move; - unsigned int imbn; - small_imbalance: - pwr_move = pwr_now = 0; - imbn = 2; - if (this_nr_running) { - this_load_per_task /= this_nr_running; - if (busiest_load_per_task this_load_per_task) - imbn = 1; - } else - this_load_per_task = SCHED_LOAD_SCALE; - - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ = - busiest_load_per_task * imbn) { - *imbalance = busiest_load_per_task; - return busiest; - } This patch removes quite a few lines and all this is logic is not for fairness :( Especially the above portion handles some of the HT/MC optimizations. - - /* - * OK, we don't have enough imbalance to justify moving tasks, - * however we may be able to increase total CPU power used by - * moving them. + /* + * When there's small imbalance, move tasks only if this + * sched_group contains a CPU whose min_vruntime is the + * maximum among all CPUs in the same domain. */ - - pwr_now += busiest-__cpu_power * - min(busiest_load_per_task, max_load); - pwr_now += this-__cpu_power * - min(this_load_per_task, this_load); - pwr_now /= SCHED_LOAD_SCALE; - - /* Amount of load we'd subtract */ - tmp = sg_div_cpu_power(busiest, - busiest_load_per_task * SCHED_LOAD_SCALE); - if (max_load tmp) - pwr_move += busiest-__cpu_power * - min(busiest_load_per_task, max_load - tmp); - - /* Amount of load we'd add */ - if (max_load * busiest-__cpu_power - busiest_load_per_task * SCHED_LOAD_SCALE) - tmp = sg_div_cpu_power(this, - max_load * busiest-__cpu_power); - else - tmp = sg_div_cpu_power(this, - busiest_load_per_task * SCHED_LOAD_SCALE); - pwr_move += this-__cpu_power * - min(this_load_per_task, this_load + tmp); - pwr_move /= SCHED_LOAD_SCALE; - - /* Move if we gain throughput */ - if (pwr_move pwr_now) + if (max_vruntime_group == this) *imbalance = busiest_load_per_task; + else + *imbalance = 0; Not sure how this all interacts when some of the cpu's are idle. I have to look more closely. thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 19 Sep 2007, Siddha, Suresh B wrote: On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote: This patch attempts to improve CFS's SMP global fairness based on the new virtual time design. Removed vruntime adjustment in set_task_cpu() as it skews global fairness. Modified small_imbalance logic in find_busiest_group(). If there's small imbalance, move tasks from busiest to local sched_group only if the local group contains a CPU whose min_vruntime is the maximum among all CPUs in the same sched_domain. This prevents any CPU from advancing too far ahead in virtual time and avoids tasks thrashing between two CPUs without utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, since the load is not evenly divisible by the number of CPUs, we want the extra load to have a fair use of every CPU in the system. Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task Just as an experiment, can you run 82 tasks on 8 CPUs. Current imbalance_pct logic will not detect and fix the global fairness issue even with this patch. Right. For 82 tasks on 8 CPUs, the max error is 7.07% and min is -14.12%. It could be fixed by removing the imbalance_pct check (i.e., making imbalance_pct effectively 1). The cost would be more migrations, so I wasn't sure if that was the approach people would agree on. if (*imbalance busiest_load_per_task) { - unsigned long tmp, pwr_now, pwr_move; - unsigned int imbn; - small_imbalance: - pwr_move = pwr_now = 0; - imbn = 2; - if (this_nr_running) { - this_load_per_task /= this_nr_running; - if (busiest_load_per_task this_load_per_task) - imbn = 1; - } else - this_load_per_task = SCHED_LOAD_SCALE; - - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ = - busiest_load_per_task * imbn) { - *imbalance = busiest_load_per_task; - return busiest; - } This patch removes quite a few lines and all this is logic is not for fairness :( Especially the above portion handles some of the HT/MC optimizations. What are the optimizations? I was trying to simplify the code to use only vruntime to control balancing when there's small imbalance, but if it breaks non-fairness related optimizations, we can certainly add the code back. tong - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote: Were the experiments run on a 2-CPU system? Yes. When Xorg experiences large wait time, is it on the same CPU that has the two pinned tasks? If this is the case, then the problem could be X somehow advanced faster and got a larger vruntime then the two pinned tasks, so it had to wait for the pinned ones to catch up before it got a chance to be scheduled. Good question. I've not yet been able to capture any event where vruntimes are that far apart in sched_debug. -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Tue, Sep 18, 2007 at 10:22:43PM +0200, Ingo Molnar wrote: > (I have not tested the group scheduling bits but perhaps Srivatsa would > like to do that?) Ingo, I plan to test it today and send you any updates that may be required. -- Regards, vatsa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
* Dmitry Adamushko <[EMAIL PROTECTED]> wrote: > (3) > > rework enqueue/dequeue_entity() to get rid of > sched_class::set_curr_task(). This simplifies sched_setscheduler(), > rt_mutex_setprio() and sched_move_tasks(). ah. This makes your ready-queue patch a code size win. Applied. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
* dimm <[EMAIL PROTECTED]> wrote: > (1) > > due to the fact that we no longer keep the 'current' within the tree, > dequeue/enqueue_entity() is useless for the 'current' in > task_new_fair(). We are about to reschedule and > sched_class->put_prev_task() will put the 'current' back into the > tree, based on its new key. > > Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]> thanks, applied. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
* Dmitry Adamushko <[EMAIL PROTECTED]> wrote: > (2) > > the 'p' (task_struct) parameter in the sched_class :: yield_task() is > redundant as the caller is always the 'current'. Get rid of it. > > Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]> ah - good one! I completely forgot about that parameter and it's indeed unneeded. Applied. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
* dimm <[EMAIL PROTECTED]> wrote: > [ well, don't expect to find here anything like RDCFS (no, 'D' does > not stand for 'dumb'!). I was focused on more prosaic things in the > mean time so just didn't have time for writing it.. ] > > here is a few cleanup/simplification/optimization(s) based on the > recent modifications in the sched-dev tree. > > (1) optimize task_new_fair() > (2) simplify yield_task() > (3) rework enqueue/dequeue_entity() to get rid of > sched_class::set_curr_task() the queue with your enhancements and simplifications applied looks good here, and it booted fine on two testboxes. I've updated the sched-devel.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git (below is the full shortlog over current upstream.) (I have not tested the group scheduling bits but perhaps Srivatsa would like to do that?) Ingo --> Dmitry Adamushko (9): sched: clean up struct load_stat sched: clean up schedstat block in dequeue_entity() sched: sched_setscheduler() fix sched: add set_curr_task() calls sched: do not keep current in the tree and get rid of sched_entity::fair_key sched: yield-workaround update sched: optimize task_new_fair() sched: simplify sched_class::yield_task() sched: rework enqueue/dequeue_entity() to get rid of set_curr_task() Ingo Molnar (29): sched: fix new-task method sched: resched task in task_new_fair() sched: small sched_debug cleanup sched: debug: track maximum 'slice' sched: uniform tunings sched: use constants if !CONFIG_SCHED_DEBUG sched: remove stat_gran sched: remove precise CPU load sched: remove precise CPU load calculations #2 sched: track cfs_rq->curr on !group-scheduling too sched: cleanup: simplify cfs_rq_curr() methods sched: uninline __enqueue_entity()/__dequeue_entity() sched: speed up update_load_add/_sub() sched: clean up calc_weighted() sched: introduce se->vruntime sched: move sched_feat() definitions sched: optimize vruntime based scheduling sched: simplify check_preempt() methods sched: wakeup granularity fix sched: add se->vruntime debugging sched: sync up ->min_vruntime when going idle sched: add more vruntime statistics sched: debug: update exec_clock only when SCHED_DEBUG sched: remove wait_runtime limit sched: remove wait_runtime fields and features sched: x86: allow single-depth wchan output sched: yield workaround sched: fix delay accounting performance regression sched: disable START_DEBIT Matthias Kaehlcke (1): sched: use list_for_each_entry_safe() in __wake_up_common() Mike Galbraith (1): sched: fix SMP migration latencies Peter Zijlstra (7): sched: simplify SCHED_FEAT_* code sched: new task placement for vruntime sched: simplify adaptive latency sched: clean up new task placement sched: add tree based averages sched: handle vruntime overflow sched: better min_vruntime tracking Srivatsa Vaddagiri (1): sched: group-scheduler core arch/i386/Kconfig | 11 include/linux/sched.h | 24 - init/Kconfig |9 kernel/sched.c| 544 kernel/sched_debug.c | 100 ++ kernel/sched_fair.c | 752 -- kernel/sched_rt.c |4 kernel/sched_stats.h |4 kernel/sysctl.c | 35 +- 9 files changed, 763 insertions(+), 720 deletions(-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
(3) rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task(). This simplifies sched_setscheduler(), rt_mutex_setprio() and sched_move_tasks(). Signed-off-by : Dmitry Adamushko <[EMAIL PROTECTED]> Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]> --- diff --git a/include/linux/sched.h b/include/linux/sched.h index 3728cd6..1094804 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -870,7 +870,6 @@ struct sched_class { struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, int *this_best_prio); - void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); void (*task_new) (struct rq *rq, struct task_struct *p); }; diff --git a/kernel/sched.c b/kernel/sched.c index 361fad8..bc1a625 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3910,8 +3910,8 @@ EXPORT_SYMBOL(sleep_on_timeout); */ void rt_mutex_setprio(struct task_struct *p, int prio) { - int oldprio, on_rq, running; unsigned long flags; + int oldprio, on_rq; struct rq *rq; BUG_ON(prio < 0 || prio > MAX_PRIO); @@ -3921,12 +3921,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio) oldprio = p->prio; on_rq = p->se.on_rq; - running = task_running(rq, p); - if (on_rq) { + if (on_rq) dequeue_task(rq, p, 0); - if (running) - p->sched_class->put_prev_task(rq, p); - } if (rt_prio(prio)) p->sched_class = _sched_class; @@ -3936,15 +3932,13 @@ void rt_mutex_setprio(struct task_struct *p, int prio) p->prio = prio; if (on_rq) { - if (running) - p->sched_class->set_curr_task(rq); enqueue_task(rq, p, 0); /* * Reschedule if we are currently running on this runqueue and * our priority decreased, or if we are not currently running on * this runqueue and our priority is higher than the current's */ - if (running) { + if (task_running(rq, p)) { if (p->prio > oldprio) resched_task(rq->curr); } else { @@ -4150,7 +4144,7 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio) int sched_setscheduler(struct task_struct *p, int policy, struct sched_param *param) { - int retval, oldprio, oldpolicy = -1, on_rq, running; + int retval, oldprio, oldpolicy = -1, on_rq; unsigned long flags; struct rq *rq; @@ -4232,24 +4226,20 @@ recheck: } update_rq_clock(rq); on_rq = p->se.on_rq; - running = task_running(rq, p); - if (on_rq) { + if (on_rq) deactivate_task(rq, p, 0); - if (running) - p->sched_class->put_prev_task(rq, p); - } + oldprio = p->prio; __setscheduler(rq, p, policy, param->sched_priority); + if (on_rq) { - if (running) - p->sched_class->set_curr_task(rq); activate_task(rq, p, 0); /* * Reschedule if we are currently running on this runqueue and * our priority decreased, or if we are not currently running on * this runqueue and our priority is higher than the current's */ - if (running) { + if (task_running(rq, p)) { if (p->prio > oldprio) resched_task(rq->curr); } else { @@ -6853,19 +6843,13 @@ static void sched_move_task(struct container_subsys *ss, struct container *cont, running = task_running(rq, tsk); on_rq = tsk->se.on_rq; - if (on_rq) { + if (on_rq) dequeue_task(rq, tsk, 0); - if (unlikely(running)) - tsk->sched_class->put_prev_task(rq, tsk); - } set_task_cfs_rq(tsk); - if (on_rq) { - if (unlikely(running)) - tsk->sched_class->set_curr_task(rq); + if (on_rq) enqueue_task(rq, tsk, 0); - } done: task_rq_unlock(rq, ); diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index e65af8c..6539377 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -475,9 +475,20 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) } static void -enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup) +enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, + int wakeup, int set_curr) { /* +* In case of the 'current'. +*/ + if (unlikely(set_curr)) { + update_stats_curr_start(cfs_rq, se);
[git] CFS-devel, group scheduler, fixes
(2) the 'p' (task_struct) parameter in the sched_class :: yield_task() is redundant as the caller is always the 'current'. Get rid of it. Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]> --- diff --git a/include/linux/sched.h b/include/linux/sched.h index 9fd936f..3728cd6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -857,7 +857,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); - void (*yield_task) (struct rq *rq, struct task_struct *p); + void (*yield_task) (struct rq *rq); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index 046dae1..361fad8 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4535,7 +4535,7 @@ asmlinkage long sys_sched_yield(void) if (unlikely(rq->nr_running == 1)) schedstat_inc(rq, yld_act_empty); else - current->sched_class->yield_task(rq, current); + current->sched_class->yield_task(rq); /* * Since we are going to call schedule() anyway, there's diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5a244e2..9b982ef 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -723,10 +723,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep) /* * sched_yield() support is very simple - we dequeue and enqueue */ -static void yield_task_fair(struct rq *rq, struct task_struct *p) +static void yield_task_fair(struct rq *rq) { + struct task_struct *curr = rq->curr; + struct cfs_rq *cfs_rq = task_cfs_rq(curr); + if (!sysctl_sched_yield_bug_workaround) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); __update_rq_clock(rq); /* @@ -737,7 +739,6 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p) } if (sysctl_sched_yield_bug_workaround == 1) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); struct sched_entity *next; /* @@ -757,7 +758,7 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p) /* * Minimally necessary key value to be the second in the tree: */ - p->se.vruntime = next->vruntime + + curr->se.vruntime = next->vruntime + sysctl_sched_yield_granularity; /* @@ -770,7 +771,7 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p) /* * Just reschedule, do nothing else: */ - resched_task(p); + resched_task(curr); } /* diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c index 45b339f..b86944c 100644 --- a/kernel/sched_rt.c +++ b/kernel/sched_rt.c @@ -59,9 +59,9 @@ static void requeue_task_rt(struct rq *rq, struct task_struct *p) } static void -yield_task_rt(struct rq *rq, struct task_struct *p) +yield_task_rt(struct rq *rq) { - requeue_task_rt(rq, p); + requeue_task_rt(rq, rq->curr); } /* --- Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
[ well, don't expect to find here anything like RDCFS (no, 'D' does not stand for 'dumb'!). I was focused on more prosaic things in the mean time so just didn't have time for writing it.. ] here is a few cleanup/simplification/optimization(s) based on the recent modifications in the sched-dev tree. (1) optimize task_new_fair() (2) simplify yield_task() (3) rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task() additionally, the changes somewhat decrease code size: textdata bss dec hex filename 435385398 48 48984bf58 build/kernel/sched.o.before 432505390 48 48688be30 build/kernel/sched.o (SMP + lots of debugging options but, I guess, in this case the diff should remain visible for any combination). --- (1) due to the fact that we no longer keep the 'current' within the tree, dequeue/enqueue_entity() is useless for the 'current' in task_new_fair(). We are about to reschedule and sched_class->put_prev_task() will put the 'current' back into the tree, based on its new key. Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]> --- diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 6e52d5a..5a244e2 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -969,10 +969,11 @@ static void task_new_fair(struct rq *rq, struct task_struct *p) if (sysctl_sched_child_runs_first && curr->vruntime < se->vruntime) { - - dequeue_entity(cfs_rq, curr, 0); + /* +* Upon rescheduling, sched_class::put_prev_task() will place +* 'current' within the tree based on its new key value. +*/ swap(curr->vruntime, se->vruntime); - enqueue_entity(cfs_rq, curr, 0); } update_stats_enqueue(cfs_rq, se); --- Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
[ well, don't expect to find here anything like RDCFS (no, 'D' does not stand for 'dumb'!). I was focused on more prosaic things in the mean time so just didn't have time for writing it.. ] here is a few cleanup/simplification/optimization(s) based on the recent modifications in the sched-dev tree. (1) optimize task_new_fair() (2) simplify yield_task() (3) rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task() additionally, the changes somewhat decrease code size: textdata bss dec hex filename 435385398 48 48984bf58 build/kernel/sched.o.before 432505390 48 48688be30 build/kernel/sched.o (SMP + lots of debugging options but, I guess, in this case the diff should remain visible for any combination). --- (1) due to the fact that we no longer keep the 'current' within the tree, dequeue/enqueue_entity() is useless for the 'current' in task_new_fair(). We are about to reschedule and sched_class-put_prev_task() will put the 'current' back into the tree, based on its new key. Signed-off-by: Dmitry Adamushko [EMAIL PROTECTED] --- diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 6e52d5a..5a244e2 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -969,10 +969,11 @@ static void task_new_fair(struct rq *rq, struct task_struct *p) if (sysctl_sched_child_runs_first curr-vruntime se-vruntime) { - - dequeue_entity(cfs_rq, curr, 0); + /* +* Upon rescheduling, sched_class::put_prev_task() will place +* 'current' within the tree based on its new key value. +*/ swap(curr-vruntime, se-vruntime); - enqueue_entity(cfs_rq, curr, 0); } update_stats_enqueue(cfs_rq, se); --- Dmitry - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git] CFS-devel, group scheduler, fixes
(2) the 'p' (task_struct) parameter in the sched_class :: yield_task() is redundant as the caller is always the 'current'. Get rid of it. Signed-off-by: Dmitry Adamushko [EMAIL PROTECTED] --- diff --git a/include/linux/sched.h b/include/linux/sched.h index 9fd936f..3728cd6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -857,7 +857,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); - void (*yield_task) (struct rq *rq, struct task_struct *p); + void (*yield_task) (struct rq *rq); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index 046dae1..361fad8 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4535,7 +4535,7 @@ asmlinkage long sys_sched_yield(void) if (unlikely(rq-nr_running == 1)) schedstat_inc(rq, yld_act_empty); else - current-sched_class-yield_task(rq, current); + current-sched_class-yield_task(rq); /* * Since we are going to call schedule() anyway, there's diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5a244e2..9b982ef 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -723,10 +723,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep) /* * sched_yield() support is very simple - we dequeue and enqueue */ -static void yield_task_fair(struct rq *rq, struct task_struct *p) +static void yield_task_fair(struct rq *rq) { + struct task_struct *curr = rq-curr; + struct cfs_rq *cfs_rq = task_cfs_rq(curr); + if (!sysctl_sched_yield_bug_workaround) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); __update_rq_clock(rq); /* @@ -737,7 +739,6 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p) } if (sysctl_sched_yield_bug_workaround == 1) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); struct sched_entity *next; /* @@ -757,7 +758,7 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p) /* * Minimally necessary key value to be the second in the tree: */ - p-se.vruntime = next-vruntime + + curr-se.vruntime = next-vruntime + sysctl_sched_yield_granularity; /* @@ -770,7 +771,7 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p) /* * Just reschedule, do nothing else: */ - resched_task(p); + resched_task(curr); } /* diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c index 45b339f..b86944c 100644 --- a/kernel/sched_rt.c +++ b/kernel/sched_rt.c @@ -59,9 +59,9 @@ static void requeue_task_rt(struct rq *rq, struct task_struct *p) } static void -yield_task_rt(struct rq *rq, struct task_struct *p) +yield_task_rt(struct rq *rq) { - requeue_task_rt(rq, p); + requeue_task_rt(rq, rq-curr); } /* --- Dmitry - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
(3) rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task(). This simplifies sched_setscheduler(), rt_mutex_setprio() and sched_move_tasks(). Signed-off-by : Dmitry Adamushko [EMAIL PROTECTED] Signed-off-by : Srivatsa Vaddagiri [EMAIL PROTECTED] --- diff --git a/include/linux/sched.h b/include/linux/sched.h index 3728cd6..1094804 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -870,7 +870,6 @@ struct sched_class { struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, int *this_best_prio); - void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); void (*task_new) (struct rq *rq, struct task_struct *p); }; diff --git a/kernel/sched.c b/kernel/sched.c index 361fad8..bc1a625 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3910,8 +3910,8 @@ EXPORT_SYMBOL(sleep_on_timeout); */ void rt_mutex_setprio(struct task_struct *p, int prio) { - int oldprio, on_rq, running; unsigned long flags; + int oldprio, on_rq; struct rq *rq; BUG_ON(prio 0 || prio MAX_PRIO); @@ -3921,12 +3921,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio) oldprio = p-prio; on_rq = p-se.on_rq; - running = task_running(rq, p); - if (on_rq) { + if (on_rq) dequeue_task(rq, p, 0); - if (running) - p-sched_class-put_prev_task(rq, p); - } if (rt_prio(prio)) p-sched_class = rt_sched_class; @@ -3936,15 +3932,13 @@ void rt_mutex_setprio(struct task_struct *p, int prio) p-prio = prio; if (on_rq) { - if (running) - p-sched_class-set_curr_task(rq); enqueue_task(rq, p, 0); /* * Reschedule if we are currently running on this runqueue and * our priority decreased, or if we are not currently running on * this runqueue and our priority is higher than the current's */ - if (running) { + if (task_running(rq, p)) { if (p-prio oldprio) resched_task(rq-curr); } else { @@ -4150,7 +4144,7 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio) int sched_setscheduler(struct task_struct *p, int policy, struct sched_param *param) { - int retval, oldprio, oldpolicy = -1, on_rq, running; + int retval, oldprio, oldpolicy = -1, on_rq; unsigned long flags; struct rq *rq; @@ -4232,24 +4226,20 @@ recheck: } update_rq_clock(rq); on_rq = p-se.on_rq; - running = task_running(rq, p); - if (on_rq) { + if (on_rq) deactivate_task(rq, p, 0); - if (running) - p-sched_class-put_prev_task(rq, p); - } + oldprio = p-prio; __setscheduler(rq, p, policy, param-sched_priority); + if (on_rq) { - if (running) - p-sched_class-set_curr_task(rq); activate_task(rq, p, 0); /* * Reschedule if we are currently running on this runqueue and * our priority decreased, or if we are not currently running on * this runqueue and our priority is higher than the current's */ - if (running) { + if (task_running(rq, p)) { if (p-prio oldprio) resched_task(rq-curr); } else { @@ -6853,19 +6843,13 @@ static void sched_move_task(struct container_subsys *ss, struct container *cont, running = task_running(rq, tsk); on_rq = tsk-se.on_rq; - if (on_rq) { + if (on_rq) dequeue_task(rq, tsk, 0); - if (unlikely(running)) - tsk-sched_class-put_prev_task(rq, tsk); - } set_task_cfs_rq(tsk); - if (on_rq) { - if (unlikely(running)) - tsk-sched_class-set_curr_task(rq); + if (on_rq) enqueue_task(rq, tsk, 0); - } done: task_rq_unlock(rq, flags); diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index e65af8c..6539377 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -475,9 +475,20 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) } static void -enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup) +enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, + int wakeup, int set_curr) { /* +* In case of the 'current'. +*/ + if (unlikely(set_curr)) { + update_stats_curr_start(cfs_rq, se); +
Re: [git] CFS-devel, group scheduler, fixes
* dimm [EMAIL PROTECTED] wrote: [ well, don't expect to find here anything like RDCFS (no, 'D' does not stand for 'dumb'!). I was focused on more prosaic things in the mean time so just didn't have time for writing it.. ] here is a few cleanup/simplification/optimization(s) based on the recent modifications in the sched-dev tree. (1) optimize task_new_fair() (2) simplify yield_task() (3) rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task() the queue with your enhancements and simplifications applied looks good here, and it booted fine on two testboxes. I've updated the sched-devel.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git (below is the full shortlog over current upstream.) (I have not tested the group scheduling bits but perhaps Srivatsa would like to do that?) Ingo -- Dmitry Adamushko (9): sched: clean up struct load_stat sched: clean up schedstat block in dequeue_entity() sched: sched_setscheduler() fix sched: add set_curr_task() calls sched: do not keep current in the tree and get rid of sched_entity::fair_key sched: yield-workaround update sched: optimize task_new_fair() sched: simplify sched_class::yield_task() sched: rework enqueue/dequeue_entity() to get rid of set_curr_task() Ingo Molnar (29): sched: fix new-task method sched: resched task in task_new_fair() sched: small sched_debug cleanup sched: debug: track maximum 'slice' sched: uniform tunings sched: use constants if !CONFIG_SCHED_DEBUG sched: remove stat_gran sched: remove precise CPU load sched: remove precise CPU load calculations #2 sched: track cfs_rq-curr on !group-scheduling too sched: cleanup: simplify cfs_rq_curr() methods sched: uninline __enqueue_entity()/__dequeue_entity() sched: speed up update_load_add/_sub() sched: clean up calc_weighted() sched: introduce se-vruntime sched: move sched_feat() definitions sched: optimize vruntime based scheduling sched: simplify check_preempt() methods sched: wakeup granularity fix sched: add se-vruntime debugging sched: sync up -min_vruntime when going idle sched: add more vruntime statistics sched: debug: update exec_clock only when SCHED_DEBUG sched: remove wait_runtime limit sched: remove wait_runtime fields and features sched: x86: allow single-depth wchan output sched: yield workaround sched: fix delay accounting performance regression sched: disable START_DEBIT Matthias Kaehlcke (1): sched: use list_for_each_entry_safe() in __wake_up_common() Mike Galbraith (1): sched: fix SMP migration latencies Peter Zijlstra (7): sched: simplify SCHED_FEAT_* code sched: new task placement for vruntime sched: simplify adaptive latency sched: clean up new task placement sched: add tree based averages sched: handle vruntime overflow sched: better min_vruntime tracking Srivatsa Vaddagiri (1): sched: group-scheduler core arch/i386/Kconfig | 11 include/linux/sched.h | 24 - init/Kconfig |9 kernel/sched.c| 544 kernel/sched_debug.c | 100 ++ kernel/sched_fair.c | 752 -- kernel/sched_rt.c |4 kernel/sched_stats.h |4 kernel/sysctl.c | 35 +- 9 files changed, 763 insertions(+), 720 deletions(-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
* Dmitry Adamushko [EMAIL PROTECTED] wrote: (2) the 'p' (task_struct) parameter in the sched_class :: yield_task() is redundant as the caller is always the 'current'. Get rid of it. Signed-off-by: Dmitry Adamushko [EMAIL PROTECTED] ah - good one! I completely forgot about that parameter and it's indeed unneeded. Applied. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
* dimm [EMAIL PROTECTED] wrote: (1) due to the fact that we no longer keep the 'current' within the tree, dequeue/enqueue_entity() is useless for the 'current' in task_new_fair(). We are about to reschedule and sched_class-put_prev_task() will put the 'current' back into the tree, based on its new key. Signed-off-by: Dmitry Adamushko [EMAIL PROTECTED] thanks, applied. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
* Dmitry Adamushko [EMAIL PROTECTED] wrote: (3) rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task(). This simplifies sched_setscheduler(), rt_mutex_setprio() and sched_move_tasks(). ah. This makes your ready-queue patch a code size win. Applied. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git] CFS-devel, group scheduler, fixes
On Tue, Sep 18, 2007 at 10:22:43PM +0200, Ingo Molnar wrote: (I have not tested the group scheduling bits but perhaps Srivatsa would like to do that?) Ingo, I plan to test it today and send you any updates that may be required. -- Regards, vatsa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git] CFS-devel, group scheduler, fixes
The latest sched-devel.git tree can be pulled from: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git people were busy sending patches, so there's lots of updates since the first announcement of the cfs-devel.git tree four days ago: include/linux/sched.h |4 init/Kconfig|9 + kernel/sched.c | 374 +++- kernel/sched_debug.c| 22 +- kernel/sched_fair.c | 191 kernel/sched_idletask.c |5 kernel/sched_rt.c |5 kernel/sysctl.c | 19 ++ 8 files changed, 552 insertions(+), 77 deletions(-) the most user-noticeable improvement should be the latency/interactivity fixes from Mike Galbraith and Peter Zijlstra. The biggest (and most complex) item is the new group scheduler code from Srivatsa Vaddagiri. There's also speedups from Dmitry Adamushko and various cleanups and fixes. Also added in the yield workaround that should address the regression reported by Antoine Martin. Changes: - the biggest item is the addition of the group scheduler from Srivatsa Vaddagiri - this is not configurable yet, it depends on CONFIG_CONTAINERS. It causes no overhead on !CONFIG_CONTAINERS. This code clearly seems mature now, hopefully the container bits go upstream in 2.6.24 too. Srivatsa did lots of heavy lifting in the past few months, and this final bit of code that moves in all the infrastructure changes almost nothing in the core scheduler. - a triplet of nice simplifications from Dmitry Adamushko. Most notably Dmitry got rid of se->fair_key which shaves 8 bytes of task_struct and gives further speedup. Thanks Dmitry! - continued refinements to the SMP ->vruntime code and timeslicing by Peter Zijstra and Mike Galbraith. As usual, bugreports, fixes and suggestions are welcome and please holler if some patch went missing in action. Ingo --> Dmitry Adamushko (3): sched: clean up struct load_stat sched: clean up schedstat block in dequeue_entity() sched: optimize away ->fair_key Matthias Kaehlcke (1): sched: use list_for_each_entry_safe() in __wake_up_common() Mike Galbraith (1): sched: fix SMP migration latencies Peter Zijlstra (7): sched: simplify SCHED_FEAT_* code sched: new task placement for vruntime sched: simplify adaptive latency sched: clean up new task placement sched: add tree based averages sched: handle vruntime overflow sched: better min_vruntime tracking Srivatsa Vaddagiri (1): sched: group-scheduler core Ingo Molnar (27): sched: fix new-task method sched: resched task in task_new_fair() sched: small sched_debug cleanup sched: debug: track maximum 'slice' sched: uniform tunings sched: use constants if !CONFIG_SCHED_DEBUG sched: remove stat_gran sched: remove precise CPU load sched: remove precise CPU load calculations #2 sched: track cfs_rq->curr on !group-scheduling too sched: cleanup: simplify cfs_rq_curr() methods sched: uninline __enqueue_entity()/__dequeue_entity() sched: speed up update_load_add/_sub() sched: clean up calc_weighted() sched: introduce se->vruntime sched: move sched_feat() definitions sched: optimize vruntime based scheduling sched: simplify check_preempt() methods sched: wakeup granularity fix sched: add se->vruntime debugging sched: sync up ->min_vruntime when going idle sched: add more vruntime statistics sched: debug: update exec_clock only when SCHED_DEBUG sched: remove wait_runtime limit sched: remove wait_runtime fields and features sched: x86: allow single-depth wchan output sched: yield workaround arch/i386/Kconfig | 11 include/linux/sched.h | 21 - init/Kconfig|9 kernel/sched.c | 558 + kernel/sched_debug.c| 100 +++--- kernel/sched_fair.c | 716 +++- kernel/sched_idletask.c |5 kernel/sched_rt.c |5 kernel/sysctl.c | 33 +- 9 files changed, 765 insertions(+), 693 deletions(-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git] CFS-devel, group scheduler, fixes
The latest sched-devel.git tree can be pulled from: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git people were busy sending patches, so there's lots of updates since the first announcement of the cfs-devel.git tree four days ago: include/linux/sched.h |4 init/Kconfig|9 + kernel/sched.c | 374 +++- kernel/sched_debug.c| 22 +- kernel/sched_fair.c | 191 kernel/sched_idletask.c |5 kernel/sched_rt.c |5 kernel/sysctl.c | 19 ++ 8 files changed, 552 insertions(+), 77 deletions(-) the most user-noticeable improvement should be the latency/interactivity fixes from Mike Galbraith and Peter Zijlstra. The biggest (and most complex) item is the new group scheduler code from Srivatsa Vaddagiri. There's also speedups from Dmitry Adamushko and various cleanups and fixes. Also added in the yield workaround that should address the regression reported by Antoine Martin. Changes: - the biggest item is the addition of the group scheduler from Srivatsa Vaddagiri - this is not configurable yet, it depends on CONFIG_CONTAINERS. It causes no overhead on !CONFIG_CONTAINERS. This code clearly seems mature now, hopefully the container bits go upstream in 2.6.24 too. Srivatsa did lots of heavy lifting in the past few months, and this final bit of code that moves in all the infrastructure changes almost nothing in the core scheduler. - a triplet of nice simplifications from Dmitry Adamushko. Most notably Dmitry got rid of se-fair_key which shaves 8 bytes of task_struct and gives further speedup. Thanks Dmitry! - continued refinements to the SMP -vruntime code and timeslicing by Peter Zijstra and Mike Galbraith. As usual, bugreports, fixes and suggestions are welcome and please holler if some patch went missing in action. Ingo -- Dmitry Adamushko (3): sched: clean up struct load_stat sched: clean up schedstat block in dequeue_entity() sched: optimize away -fair_key Matthias Kaehlcke (1): sched: use list_for_each_entry_safe() in __wake_up_common() Mike Galbraith (1): sched: fix SMP migration latencies Peter Zijlstra (7): sched: simplify SCHED_FEAT_* code sched: new task placement for vruntime sched: simplify adaptive latency sched: clean up new task placement sched: add tree based averages sched: handle vruntime overflow sched: better min_vruntime tracking Srivatsa Vaddagiri (1): sched: group-scheduler core Ingo Molnar (27): sched: fix new-task method sched: resched task in task_new_fair() sched: small sched_debug cleanup sched: debug: track maximum 'slice' sched: uniform tunings sched: use constants if !CONFIG_SCHED_DEBUG sched: remove stat_gran sched: remove precise CPU load sched: remove precise CPU load calculations #2 sched: track cfs_rq-curr on !group-scheduling too sched: cleanup: simplify cfs_rq_curr() methods sched: uninline __enqueue_entity()/__dequeue_entity() sched: speed up update_load_add/_sub() sched: clean up calc_weighted() sched: introduce se-vruntime sched: move sched_feat() definitions sched: optimize vruntime based scheduling sched: simplify check_preempt() methods sched: wakeup granularity fix sched: add se-vruntime debugging sched: sync up -min_vruntime when going idle sched: add more vruntime statistics sched: debug: update exec_clock only when SCHED_DEBUG sched: remove wait_runtime limit sched: remove wait_runtime fields and features sched: x86: allow single-depth wchan output sched: yield workaround arch/i386/Kconfig | 11 include/linux/sched.h | 21 - init/Kconfig|9 kernel/sched.c | 558 + kernel/sched_debug.c| 100 +++--- kernel/sched_fair.c | 716 +++- kernel/sched_idletask.c |5 kernel/sched_rt.c |5 kernel/sysctl.c | 33 +- 9 files changed, 765 insertions(+), 693 deletions(-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/