Re: Make yield_task_fair more efficient
Jens Axboe wrote: > On Thu, Feb 21 2008, Jens Axboe wrote: >> On Thu, Feb 21 2008, Peter Zijlstra wrote: >>> On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: >>> You use the empty pointer (missing right child), so why do we need a list. May be I am missing something. >>> A fully threaded tree also has back-pointer to traverse backwards >>> through the ordered elements. >>> >>> That said, overloading the right child pointer might not be the best >>> thing for the linux kernel, as it will impact all the rb-tree lookups >>> which are open-coded and often performance critical (this is the reason >>> the colour isn't bit encoded in either of the child pointers either). >>> >>> But if you only want a uni directional thread, I guess we can stick it >>> in the unsigned long we use for the node colour. >>> >>> Still, perhaps it's worth it to grow rb_node to 4 words and do the fully >>> threaded thing as there are also a lot of rb_prev() users in the kernel. >>> Who knows.. >>> >>> Anyway, I agree that improving rb_next() is worth looking into for the >>> scheduler. >> For the IO scheduler as well, it's used quite extensively! So speeding >> up rb_next() would definitely help, as it's typically invoked for every >> bio queued (attempting to back merge with the next request). CFQ and AS >> additionally does an rb_next() and rb_prev() when trying to decide which >> request to do next. > > One possible course of action to implement this without eating extra > space in the rb_node would be: > > - Add rb_right() and rb_set_right() (plus ditto _left variants) to > rbtree.h > - Convert all in-kernel users to use these. Quite extensive, as the > rbtree code search/insert functions are coded in situ and not in > rbtree.[ch] > - Now we can overload bit 0 of ->rb_right and ->rb_left to indicate > whether this is a node or thread pointer and modify rbtree.c to tag > and add the thread links when appropriate. > Exactly along the lines I was thinking of.and discussing with David. > So we can definitely do this in a compatible fashion. Given that I have > a flight coming up in a few days time, I may give it a got if no one > beats me to it :-) > Feel free to do so, please do keep me on the cc. I am very interested in getting rb threaded trees done, but my bandwidth is a little limited this month. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, Feb 21 2008, Jens Axboe wrote: > On Thu, Feb 21 2008, Peter Zijlstra wrote: > > > > On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: > > > > > You use the empty pointer (missing right child), so why do we need a > > > list. May > > > be I am missing something. > > > > A fully threaded tree also has back-pointer to traverse backwards > > through the ordered elements. > > > > That said, overloading the right child pointer might not be the best > > thing for the linux kernel, as it will impact all the rb-tree lookups > > which are open-coded and often performance critical (this is the reason > > the colour isn't bit encoded in either of the child pointers either). > > > > But if you only want a uni directional thread, I guess we can stick it > > in the unsigned long we use for the node colour. > > > > Still, perhaps it's worth it to grow rb_node to 4 words and do the fully > > threaded thing as there are also a lot of rb_prev() users in the kernel. > > Who knows.. > > > > Anyway, I agree that improving rb_next() is worth looking into for the > > scheduler. > > For the IO scheduler as well, it's used quite extensively! So speeding > up rb_next() would definitely help, as it's typically invoked for every > bio queued (attempting to back merge with the next request). CFQ and AS > additionally does an rb_next() and rb_prev() when trying to decide which > request to do next. One possible course of action to implement this without eating extra space in the rb_node would be: - Add rb_right() and rb_set_right() (plus ditto _left variants) to rbtree.h - Convert all in-kernel users to use these. Quite extensive, as the rbtree code search/insert functions are coded in situ and not in rbtree.[ch] - Now we can overload bit 0 of ->rb_right and ->rb_left to indicate whether this is a node or thread pointer and modify rbtree.c to tag and add the thread links when appropriate. So we can definitely do this in a compatible fashion. Given that I have a flight coming up in a few days time, I may give it a got if no one beats me to it :-) -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, Feb 21 2008, Peter Zijlstra wrote: > > On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: > > > You use the empty pointer (missing right child), so why do we need a list. > > May > > be I am missing something. > > A fully threaded tree also has back-pointer to traverse backwards > through the ordered elements. > > That said, overloading the right child pointer might not be the best > thing for the linux kernel, as it will impact all the rb-tree lookups > which are open-coded and often performance critical (this is the reason > the colour isn't bit encoded in either of the child pointers either). > > But if you only want a uni directional thread, I guess we can stick it > in the unsigned long we use for the node colour. > > Still, perhaps it's worth it to grow rb_node to 4 words and do the fully > threaded thing as there are also a lot of rb_prev() users in the kernel. > Who knows.. > > Anyway, I agree that improving rb_next() is worth looking into for the > scheduler. For the IO scheduler as well, it's used quite extensively! So speeding up rb_next() would definitely help, as it's typically invoked for every bio queued (attempting to back merge with the next request). CFQ and AS additionally does an rb_next() and rb_prev() when trying to decide which request to do next. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 21 February 2008 12:21:33 +0530, Balbir Singh wrote: > > For a large number of tasks - say 1, we need to walk 14 levels before we 16.7, actually. rbtrees are balanced treed, but they are not balanced binary trees. The average fan-out is sqrt(3) instead of 2. Jörn -- The cost of changing business rules is much more expensive for software than for a secretaty. -- unknown -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Mike Galbraith wrote: > Hi, > > On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: >> Ingo Molnar wrote: >>> * Balbir Singh <[EMAIL PROTECTED]> wrote: >>> If you insist that sched_yield() is bad, I might agree, but how does my patch make things worse. [...] >>> it puts new instructions into the hotpath. >>> [...] In my benchmarks, it has helped the sched_yield case, why is that bad? [...] >>> I had the same cache for the rightmost task in earlier CFS (it's a >>> really obvious thing) but removed it. It wasnt a bad idea, but it hurt >>> the fastpath hence i removed it. Algorithms and implementations are a >>> constant balancing act. >> This is more convincing, was the code ever in git? How did you measure the >> overhead? > > Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds > netperf test that does ~85k/s context switches shows: > > sched_cycles: 7198444348 unpatched > vs > sched_cycles: 8574036268 patched Thanks for the numbers! I am very convinced that the patch should stay out until we can find a way to reduce the overhead. I'll try your patch and see what the numbers look like as well. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 13:02 +0100, Mike Galbraith wrote: > sched_cycles: 7198444348 unpatched > vs > sched_cycles: 8574036268 patched (in case it's not clear, patched means your patch, not my quick/dirty countem patch:) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Hi, On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: > Ingo Molnar wrote: > > * Balbir Singh <[EMAIL PROTECTED]> wrote: > > > >> If you insist that sched_yield() is bad, I might agree, but how does > >> my patch make things worse. [...] > > > > it puts new instructions into the hotpath. > > > >> [...] In my benchmarks, it has helped the sched_yield case, why is > >> that bad? [...] > > > > I had the same cache for the rightmost task in earlier CFS (it's a > > really obvious thing) but removed it. It wasnt a bad idea, but it hurt > > the fastpath hence i removed it. Algorithms and implementations are a > > constant balancing act. > > This is more convincing, was the code ever in git? How did you measure the > overhead? Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds netperf test that does ~85k/s context switches shows: sched_cycles: 7198444348 unpatched vs sched_cycles: 8574036268 patched -Mike diff --git a/include/linux/sched.h b/include/linux/sched.h index e217d18..a71edfe 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1262,6 +1262,7 @@ struct task_struct { int latency_record_count; struct latency_record latency_record[LT_SAVECOUNT]; #endif + unsigned long long sched_cycles; }; /* diff --git a/kernel/exit.c b/kernel/exit.c index 506a957..c5c3d5e 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -174,6 +174,8 @@ repeat: } write_unlock_irq(_lock); + if (!strcmp("netperf", p->comm) || !strcmp("netserver", p->comm)) + printk(KERN_WARNING "%s:%d sched_cycles: %Ld\n", p->comm, p->pid, p->sched_cycles); release_thread(p); call_rcu(>rcu, delayed_put_task_struct); diff --git a/kernel/sched.c b/kernel/sched.c index f28f19e..d7c1b0b 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1301,15 +1301,25 @@ static void set_load_weight(struct task_struct *p) static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup) { + unsigned long long then, now; + + rdtscll(then); sched_info_queued(p); p->sched_class->enqueue_task(rq, p, wakeup); p->se.on_rq = 1; + rdtscll(now); + p->sched_cycles += now - then; } static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep) { + unsigned long long then, now; + + rdtscll(then); p->sched_class->dequeue_task(rq, p, sleep); p->se.on_rq = 0; + rdtscll(now); + p->sched_cycles += now - then; } /* @@ -2009,6 +2019,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags) update_rq_clock(rq); p->prio = effective_prio(p); + p->sched_cycles = 0; if (!p->sched_class->task_new || !current->se.on_rq) { activate_task(rq, p, 0); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Peter Zijlstra wrote: > On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: > >> You use the empty pointer (missing right child), so why do we need a list. >> May >> be I am missing something. > > A fully threaded tree also has back-pointer to traverse backwards > through the ordered elements. > The back pointer is implemented using missing left children. I am referring to Don Knuth's volume 1: The art of computer programming, section 2.3.1, page 322. Please see the threaded representation of the tree, it is compared with the unthreaded representation. > That said, overloading the right child pointer might not be the best > thing for the linux kernel, as it will impact all the rb-tree lookups > which are open-coded and often performance critical (this is the reason > the colour isn't bit encoded in either of the child pointers either). > We always look at the child, irrespective of whether it is to the right or left to terminate our walk. Overloading them and setting a bit stating it is threaded should not be that bad. > But if you only want a uni directional thread, I guess we can stick it > in the unsigned long we use for the node colour. > > Still, perhaps it's worth it to grow rb_node to 4 words and do the fully > threaded thing as there are also a lot of rb_prev() users in the kernel. > Who knows.. > Why does the rb_node need to grow? We can encode the bits in the children > Anyway, I agree that improving rb_next() is worth looking into for the > scheduler. Sure, will experiment and see if we can bump up performance. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: > You use the empty pointer (missing right child), so why do we need a list. May > be I am missing something. A fully threaded tree also has back-pointer to traverse backwards through the ordered elements. That said, overloading the right child pointer might not be the best thing for the linux kernel, as it will impact all the rb-tree lookups which are open-coded and often performance critical (this is the reason the colour isn't bit encoded in either of the child pointers either). But if you only want a uni directional thread, I guess we can stick it in the unsigned long we use for the node colour. Still, perhaps it's worth it to grow rb_node to 4 words and do the fully threaded thing as there are also a lot of rb_prev() users in the kernel. Who knows.. Anyway, I agree that improving rb_next() is worth looking into for the scheduler. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Balbir Singh wrote: > Ingo Molnar wrote: >> * Balbir Singh <[EMAIL PROTECTED]> wrote: >> >>> If you insist that sched_yield() is bad, I might agree, but how does >>> my patch make things worse. [...] >> it puts new instructions into the hotpath. >> >>> [...] In my benchmarks, it has helped the sched_yield case, why is >>> that bad? [...] >> I had the same cache for the rightmost task in earlier CFS (it's a >> really obvious thing) but removed it. It wasnt a bad idea, but it hurt >> the fastpath hence i removed it. Algorithms and implementations are a >> constant balancing act. > > This is more convincing, was the code ever in git? How did you measure the > overhead? What are your plans for reports with regressions where > kernel.compat_sched_yield is set to 1? > Ingo, I was looking through nptl/sysdeps/unix/sysv/linux/pthread_yield.c in the glibc sources and it looks like pthread_yield() calls sched_yield(). Applications using compat_sched_yield and pthread_yield() are also going to be impacted. I searched for pthread_yield and sched_yield using google codesearch to look at the applications that use these routines, the search list is too big for me to mark applications as candidates for potential improvement. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Peter Zijlstra wrote: > On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote: >> Peter Zijlstra wrote: >>> On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: >>> I have an alternate approach in mind (that I need to find time for), threaded-rbtrees. Walking the tree is really efficient, specially finding successor of a node. >>> Threading the rbtrees would be even more expensive, it would require a >>> list_head in each node and a full list operation for every tree >>> operation. >>> >> Peter, when I say threaded, I don't mean threaded as in tasks. Please see >> http://www.nist.gov/dads/HTML/threadedtree.html and >> http://datastructures.itgo.com/trees/tbt.htm > > I'm quite aware of the meaning of threaded in the context of data > structures. Oh! I did not mean to assume anything. Just wanted to make sure we are talking of the same thing They usually require additional list data and operations to > implement. > >From the definition A Threaded Binary Tree is a binary tree in which every node that does not have a right child has a THREAD (in actual sense, a link) to its INORDER successor. You use the empty pointer (missing right child), so why do we need a list. May be I am missing something. > Look at how the latter link you provided grows the rb_node with one > pointer to provide a single linked list. > Yes, that's a bad example I suppose. Look at http://www.cs.rutgers.edu/~kaplan/503/thread.html -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote: > Peter Zijlstra wrote: > > On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: > > > >> I have an alternate approach in mind (that I need to find time for), > >> threaded-rbtrees. Walking the tree is really efficient, specially finding > >> successor of a node. > > > > Threading the rbtrees would be even more expensive, it would require a > > list_head in each node and a full list operation for every tree > > operation. > > > > Peter, when I say threaded, I don't mean threaded as in tasks. Please see > http://www.nist.gov/dads/HTML/threadedtree.html and > http://datastructures.itgo.com/trees/tbt.htm I'm quite aware of the meaning of threaded in the context of data structures. They usually require additional list data and operations to implement. Look at how the latter link you provided grows the rb_node with one pointer to provide a single linked list. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Ingo Molnar wrote: > * Balbir Singh <[EMAIL PROTECTED]> wrote: > You did not answer some of my earlier questions. >> I have an alternate approach in mind (that I need to find time for), >> threaded-rbtrees. Walking the tree is really efficient, specially >> finding successor of a node. > > sure, feel free to experiment with those details. > > But if you want to improve Java workloads then you should probably start > by converting them to futexes instead of sched_yield(), that will > probably give far more performance than micro-optimizing the > sys_sched_yield() codepath. (especially when it happens at the expense > of other workloads, which is not acceptable for mainline) You can > rebuild your JVM easily and re-test with its locking fixed, right? No.. I don't want to optimize the JVM or rebuild it. I don't have access to any JVM code either. I wanted to do the threaded rb-trees for the case where we use spend time finding the successor using rb_next() (walking the tree). -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Peter Zijlstra wrote: > On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: > >> I have an alternate approach in mind (that I need to find time for), >> threaded-rbtrees. Walking the tree is really efficient, specially finding >> successor of a node. > > Threading the rbtrees would be even more expensive, it would require a > list_head in each node and a full list operation for every tree > operation. > Peter, when I say threaded, I don't mean threaded as in tasks. Please see http://www.nist.gov/dads/HTML/threadedtree.html and http://datastructures.itgo.com/trees/tbt.htm -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: > I have an alternate approach in mind (that I need to find time for), > threaded-rbtrees. Walking the tree is really efficient, specially finding > successor of a node. Threading the rbtrees would be even more expensive, it would require a list_head in each node and a full list operation for every tree operation. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
* Balbir Singh <[EMAIL PROTECTED]> wrote: > I have an alternate approach in mind (that I need to find time for), > threaded-rbtrees. Walking the tree is really efficient, specially > finding successor of a node. sure, feel free to experiment with those details. But if you want to improve Java workloads then you should probably start by converting them to futexes instead of sched_yield(), that will probably give far more performance than micro-optimizing the sys_sched_yield() codepath. (especially when it happens at the expense of other workloads, which is not acceptable for mainline) You can rebuild your JVM easily and re-test with its locking fixed, right? Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Ingo Molnar wrote: > * Balbir Singh <[EMAIL PROTECTED]> wrote: > >> If you insist that sched_yield() is bad, I might agree, but how does >> my patch make things worse. [...] > > it puts new instructions into the hotpath. > >> [...] In my benchmarks, it has helped the sched_yield case, why is >> that bad? [...] > > I had the same cache for the rightmost task in earlier CFS (it's a > really obvious thing) but removed it. It wasnt a bad idea, but it hurt > the fastpath hence i removed it. Algorithms and implementations are a > constant balancing act. This is more convincing, was the code ever in git? How did you measure the overhead? What are your plans for reports with regressions where kernel.compat_sched_yield is set to 1? I have an alternate approach in mind (that I need to find time for), threaded-rbtrees. Walking the tree is really efficient, specially finding successor of a node. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
* Balbir Singh <[EMAIL PROTECTED]> wrote: > If you insist that sched_yield() is bad, I might agree, but how does > my patch make things worse. [...] it puts new instructions into the hotpath. > [...] In my benchmarks, it has helped the sched_yield case, why is > that bad? [...] I had the same cache for the rightmost task in earlier CFS (it's a really obvious thing) but removed it. It wasnt a bad idea, but it hurt the fastpath hence i removed it. Algorithms and implementations are a constant balancing act. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Peter Zijlstra wrote: > On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote: > >> sched_yield() is supported API > > For SCHED_FIFO/SCHED_RR. > >> and also look at http://lkml.org/lkml/2007/9/19/351. > > Read on (http://lkml.org/lkml/2007/9/19/371) and find: > > The sched_yield() behaviour is actually very well-defined for RT > tasks (now, whether it's a good interface to use or not is still > open to debate, but at least it's a _defined_ interface ;), and > I think we should at least try to approximate that behaviour for > normal tasks, even if they aren't RT. > > sched_yield() just isn't a very useful API except in RT and even there > one can question it. > But it's being used very heavily by existing applications. Can we break/penalize them because we think it's not a good interface or a method to program? >> I am trying to make sched_yield() efficient >> when compat_sched_yield is turned on (which is most likely), since people >> will >> want that behaviour (Hint, please read the man-page for sched_yield). > > I did, so what? read the POSIX spec - its just plain undefined for > SCHED_OTHER. We already do something 'sensible' for those few broken > applications relying on unspecified behaviour. > >> There are >> already several applications using sched_yield(), so they all suffer. > > Who is using it, where is their source so we can show its faster to not > use it? > > Really, hiding behind closed sores doesn't make it good. > > I am not hiding behind closed source. Please refer to the regression reported by Yanmin where he used compat_sched_yield=1 (see http://lkml.org/lkml/2008/2/18/7). The regression might be due to other reasons, but the point is that people are using compat_sched_yield=1 If you insist that sched_yield() is bad, I might agree, but how does my patch make things worse. In my benchmarks, it has helped the sched_yield case, why is that bad? As far as touching the hot-path is concerned, give me a benchmark that I can run on my desktop, that will convince you that the overhead of the patch is not all that high. If there is an overhead that is very visible, I'll stop pushing the patch. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote: > sched_yield() is supported API For SCHED_FIFO/SCHED_RR. > and also look at http://lkml.org/lkml/2007/9/19/351. Read on (http://lkml.org/lkml/2007/9/19/371) and find: The sched_yield() behaviour is actually very well-defined for RT tasks (now, whether it's a good interface to use or not is still open to debate, but at least it's a _defined_ interface ;), and I think we should at least try to approximate that behaviour for normal tasks, even if they aren't RT. sched_yield() just isn't a very useful API except in RT and even there one can question it. > I am trying to make sched_yield() efficient > when compat_sched_yield is turned on (which is most likely), since people will > want that behaviour (Hint, please read the man-page for sched_yield). I did, so what? read the POSIX spec - its just plain undefined for SCHED_OTHER. We already do something 'sensible' for those few broken applications relying on unspecified behaviour. > There are > already several applications using sched_yield(), so they all suffer. Who is using it, where is their source so we can show its faster to not use it? Really, hiding behind closed sores doesn't make it good. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote: sched_yield() is supported API For SCHED_FIFO/SCHED_RR. and also look at http://lkml.org/lkml/2007/9/19/351. Read on (http://lkml.org/lkml/2007/9/19/371) and find: The sched_yield() behaviour is actually very well-defined for RT tasks (now, whether it's a good interface to use or not is still open to debate, but at least it's a _defined_ interface ;), and I think we should at least try to approximate that behaviour for normal tasks, even if they aren't RT. sched_yield() just isn't a very useful API except in RT and even there one can question it. I am trying to make sched_yield() efficient when compat_sched_yield is turned on (which is most likely), since people will want that behaviour (Hint, please read the man-page for sched_yield). I did, so what? read the POSIX spec - its just plain undefined for SCHED_OTHER. We already do something 'sensible' for those few broken applications relying on unspecified behaviour. There are already several applications using sched_yield(), so they all suffer. Who is using it, where is their source so we can show its faster to not use it? Really, hiding behind closed sores doesn't make it good. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Peter Zijlstra wrote: On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote: sched_yield() is supported API For SCHED_FIFO/SCHED_RR. and also look at http://lkml.org/lkml/2007/9/19/351. Read on (http://lkml.org/lkml/2007/9/19/371) and find: The sched_yield() behaviour is actually very well-defined for RT tasks (now, whether it's a good interface to use or not is still open to debate, but at least it's a _defined_ interface ;), and I think we should at least try to approximate that behaviour for normal tasks, even if they aren't RT. sched_yield() just isn't a very useful API except in RT and even there one can question it. But it's being used very heavily by existing applications. Can we break/penalize them because we think it's not a good interface or a method to program? I am trying to make sched_yield() efficient when compat_sched_yield is turned on (which is most likely), since people will want that behaviour (Hint, please read the man-page for sched_yield). I did, so what? read the POSIX spec - its just plain undefined for SCHED_OTHER. We already do something 'sensible' for those few broken applications relying on unspecified behaviour. There are already several applications using sched_yield(), so they all suffer. Who is using it, where is their source so we can show its faster to not use it? Really, hiding behind closed sores doesn't make it good. I am not hiding behind closed source. Please refer to the regression reported by Yanmin where he used compat_sched_yield=1 (see http://lkml.org/lkml/2008/2/18/7). The regression might be due to other reasons, but the point is that people are using compat_sched_yield=1 If you insist that sched_yield() is bad, I might agree, but how does my patch make things worse. In my benchmarks, it has helped the sched_yield case, why is that bad? As far as touching the hot-path is concerned, give me a benchmark that I can run on my desktop, that will convince you that the overhead of the patch is not all that high. If there is an overhead that is very visible, I'll stop pushing the patch. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
* Balbir Singh [EMAIL PROTECTED] wrote: If you insist that sched_yield() is bad, I might agree, but how does my patch make things worse. [...] it puts new instructions into the hotpath. [...] In my benchmarks, it has helped the sched_yield case, why is that bad? [...] I had the same cache for the rightmost task in earlier CFS (it's a really obvious thing) but removed it. It wasnt a bad idea, but it hurt the fastpath hence i removed it. Algorithms and implementations are a constant balancing act. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Ingo Molnar wrote: * Balbir Singh [EMAIL PROTECTED] wrote: If you insist that sched_yield() is bad, I might agree, but how does my patch make things worse. [...] it puts new instructions into the hotpath. [...] In my benchmarks, it has helped the sched_yield case, why is that bad? [...] I had the same cache for the rightmost task in earlier CFS (it's a really obvious thing) but removed it. It wasnt a bad idea, but it hurt the fastpath hence i removed it. Algorithms and implementations are a constant balancing act. This is more convincing, was the code ever in git? How did you measure the overhead? What are your plans for reports with regressions where kernel.compat_sched_yield is set to 1? I have an alternate approach in mind (that I need to find time for), threaded-rbtrees. Walking the tree is really efficient, specially finding successor of a node. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
* Balbir Singh [EMAIL PROTECTED] wrote: I have an alternate approach in mind (that I need to find time for), threaded-rbtrees. Walking the tree is really efficient, specially finding successor of a node. sure, feel free to experiment with those details. But if you want to improve Java workloads then you should probably start by converting them to futexes instead of sched_yield(), that will probably give far more performance than micro-optimizing the sys_sched_yield() codepath. (especially when it happens at the expense of other workloads, which is not acceptable for mainline) You can rebuild your JVM easily and re-test with its locking fixed, right? Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: I have an alternate approach in mind (that I need to find time for), threaded-rbtrees. Walking the tree is really efficient, specially finding successor of a node. Threading the rbtrees would be even more expensive, it would require a list_head in each node and a full list operation for every tree operation. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Ingo Molnar wrote: * Balbir Singh [EMAIL PROTECTED] wrote: You did not answer some of my earlier questions. I have an alternate approach in mind (that I need to find time for), threaded-rbtrees. Walking the tree is really efficient, specially finding successor of a node. sure, feel free to experiment with those details. But if you want to improve Java workloads then you should probably start by converting them to futexes instead of sched_yield(), that will probably give far more performance than micro-optimizing the sys_sched_yield() codepath. (especially when it happens at the expense of other workloads, which is not acceptable for mainline) You can rebuild your JVM easily and re-test with its locking fixed, right? No.. I don't want to optimize the JVM or rebuild it. I don't have access to any JVM code either. I wanted to do the threaded rb-trees for the case where we use spend time finding the successor using rb_next() (walking the tree). -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Peter Zijlstra wrote: On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: I have an alternate approach in mind (that I need to find time for), threaded-rbtrees. Walking the tree is really efficient, specially finding successor of a node. Threading the rbtrees would be even more expensive, it would require a list_head in each node and a full list operation for every tree operation. Peter, when I say threaded, I don't mean threaded as in tasks. Please see http://www.nist.gov/dads/HTML/threadedtree.html and http://datastructures.itgo.com/trees/tbt.htm -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote: Peter Zijlstra wrote: On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: I have an alternate approach in mind (that I need to find time for), threaded-rbtrees. Walking the tree is really efficient, specially finding successor of a node. Threading the rbtrees would be even more expensive, it would require a list_head in each node and a full list operation for every tree operation. Peter, when I say threaded, I don't mean threaded as in tasks. Please see http://www.nist.gov/dads/HTML/threadedtree.html and http://datastructures.itgo.com/trees/tbt.htm I'm quite aware of the meaning of threaded in the context of data structures. They usually require additional list data and operations to implement. Look at how the latter link you provided grows the rb_node with one pointer to provide a single linked list. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Peter Zijlstra wrote: On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote: Peter Zijlstra wrote: On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: I have an alternate approach in mind (that I need to find time for), threaded-rbtrees. Walking the tree is really efficient, specially finding successor of a node. Threading the rbtrees would be even more expensive, it would require a list_head in each node and a full list operation for every tree operation. Peter, when I say threaded, I don't mean threaded as in tasks. Please see http://www.nist.gov/dads/HTML/threadedtree.html and http://datastructures.itgo.com/trees/tbt.htm I'm quite aware of the meaning of threaded in the context of data structures. Oh! I did not mean to assume anything. Just wanted to make sure we are talking of the same thing They usually require additional list data and operations to implement. From the definition A Threaded Binary Tree is a binary tree in which every node that does not have a right child has a THREAD (in actual sense, a link) to its INORDER successor. You use the empty pointer (missing right child), so why do we need a list. May be I am missing something. Look at how the latter link you provided grows the rb_node with one pointer to provide a single linked list. Yes, that's a bad example I suppose. Look at http://www.cs.rutgers.edu/~kaplan/503/thread.html -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Balbir Singh wrote: Ingo Molnar wrote: * Balbir Singh [EMAIL PROTECTED] wrote: If you insist that sched_yield() is bad, I might agree, but how does my patch make things worse. [...] it puts new instructions into the hotpath. [...] In my benchmarks, it has helped the sched_yield case, why is that bad? [...] I had the same cache for the rightmost task in earlier CFS (it's a really obvious thing) but removed it. It wasnt a bad idea, but it hurt the fastpath hence i removed it. Algorithms and implementations are a constant balancing act. This is more convincing, was the code ever in git? How did you measure the overhead? What are your plans for reports with regressions where kernel.compat_sched_yield is set to 1? Ingo, I was looking through nptl/sysdeps/unix/sysv/linux/pthread_yield.c in the glibc sources and it looks like pthread_yield() calls sched_yield(). Applications using compat_sched_yield and pthread_yield() are also going to be impacted. I searched for pthread_yield and sched_yield using google codesearch to look at the applications that use these routines, the search list is too big for me to mark applications as candidates for potential improvement. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: You use the empty pointer (missing right child), so why do we need a list. May be I am missing something. A fully threaded tree also has back-pointer to traverse backwards through the ordered elements. That said, overloading the right child pointer might not be the best thing for the linux kernel, as it will impact all the rb-tree lookups which are open-coded and often performance critical (this is the reason the colour isn't bit encoded in either of the child pointers either). But if you only want a uni directional thread, I guess we can stick it in the unsigned long we use for the node colour. Still, perhaps it's worth it to grow rb_node to 4 words and do the fully threaded thing as there are also a lot of rb_prev() users in the kernel. Who knows.. Anyway, I agree that improving rb_next() is worth looking into for the scheduler. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Peter Zijlstra wrote: On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: You use the empty pointer (missing right child), so why do we need a list. May be I am missing something. A fully threaded tree also has back-pointer to traverse backwards through the ordered elements. The back pointer is implemented using missing left children. I am referring to Don Knuth's volume 1: The art of computer programming, section 2.3.1, page 322. Please see the threaded representation of the tree, it is compared with the unthreaded representation. That said, overloading the right child pointer might not be the best thing for the linux kernel, as it will impact all the rb-tree lookups which are open-coded and often performance critical (this is the reason the colour isn't bit encoded in either of the child pointers either). We always look at the child, irrespective of whether it is to the right or left to terminate our walk. Overloading them and setting a bit stating it is threaded should not be that bad. But if you only want a uni directional thread, I guess we can stick it in the unsigned long we use for the node colour. Still, perhaps it's worth it to grow rb_node to 4 words and do the fully threaded thing as there are also a lot of rb_prev() users in the kernel. Who knows.. Why does the rb_node need to grow? We can encode the bits in the children Anyway, I agree that improving rb_next() is worth looking into for the scheduler. Sure, will experiment and see if we can bump up performance. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Hi, On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: Ingo Molnar wrote: * Balbir Singh [EMAIL PROTECTED] wrote: If you insist that sched_yield() is bad, I might agree, but how does my patch make things worse. [...] it puts new instructions into the hotpath. [...] In my benchmarks, it has helped the sched_yield case, why is that bad? [...] I had the same cache for the rightmost task in earlier CFS (it's a really obvious thing) but removed it. It wasnt a bad idea, but it hurt the fastpath hence i removed it. Algorithms and implementations are a constant balancing act. This is more convincing, was the code ever in git? How did you measure the overhead? Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds netperf test that does ~85k/s context switches shows: sched_cycles: 7198444348 unpatched vs sched_cycles: 8574036268 patched -Mike diff --git a/include/linux/sched.h b/include/linux/sched.h index e217d18..a71edfe 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1262,6 +1262,7 @@ struct task_struct { int latency_record_count; struct latency_record latency_record[LT_SAVECOUNT]; #endif + unsigned long long sched_cycles; }; /* diff --git a/kernel/exit.c b/kernel/exit.c index 506a957..c5c3d5e 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -174,6 +174,8 @@ repeat: } write_unlock_irq(tasklist_lock); + if (!strcmp(netperf, p-comm) || !strcmp(netserver, p-comm)) + printk(KERN_WARNING %s:%d sched_cycles: %Ld\n, p-comm, p-pid, p-sched_cycles); release_thread(p); call_rcu(p-rcu, delayed_put_task_struct); diff --git a/kernel/sched.c b/kernel/sched.c index f28f19e..d7c1b0b 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1301,15 +1301,25 @@ static void set_load_weight(struct task_struct *p) static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup) { + unsigned long long then, now; + + rdtscll(then); sched_info_queued(p); p-sched_class-enqueue_task(rq, p, wakeup); p-se.on_rq = 1; + rdtscll(now); + p-sched_cycles += now - then; } static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep) { + unsigned long long then, now; + + rdtscll(then); p-sched_class-dequeue_task(rq, p, sleep); p-se.on_rq = 0; + rdtscll(now); + p-sched_cycles += now - then; } /* @@ -2009,6 +2019,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags) update_rq_clock(rq); p-prio = effective_prio(p); + p-sched_cycles = 0; if (!p-sched_class-task_new || !current-se.on_rq) { activate_task(rq, p, 0); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 2008-02-21 at 13:02 +0100, Mike Galbraith wrote: sched_cycles: 7198444348 unpatched vs sched_cycles: 8574036268 patched (in case it's not clear, patched means your patch, not my quick/dirty countem patch:) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Mike Galbraith wrote: Hi, On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote: Ingo Molnar wrote: * Balbir Singh [EMAIL PROTECTED] wrote: If you insist that sched_yield() is bad, I might agree, but how does my patch make things worse. [...] it puts new instructions into the hotpath. [...] In my benchmarks, it has helped the sched_yield case, why is that bad? [...] I had the same cache for the rightmost task in earlier CFS (it's a really obvious thing) but removed it. It wasnt a bad idea, but it hurt the fastpath hence i removed it. Algorithms and implementations are a constant balancing act. This is more convincing, was the code ever in git? How did you measure the overhead? Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds netperf test that does ~85k/s context switches shows: sched_cycles: 7198444348 unpatched vs sched_cycles: 8574036268 patched Thanks for the numbers! I am very convinced that the patch should stay out until we can find a way to reduce the overhead. I'll try your patch and see what the numbers look like as well. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, 21 February 2008 12:21:33 +0530, Balbir Singh wrote: For a large number of tasks - say 1, we need to walk 14 levels before we 16.7, actually. rbtrees are balanced treed, but they are not balanced binary trees. The average fan-out is sqrt(3) instead of 2. Jörn -- The cost of changing business rules is much more expensive for software than for a secretaty. -- unknown -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, Feb 21 2008, Peter Zijlstra wrote: On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: You use the empty pointer (missing right child), so why do we need a list. May be I am missing something. A fully threaded tree also has back-pointer to traverse backwards through the ordered elements. That said, overloading the right child pointer might not be the best thing for the linux kernel, as it will impact all the rb-tree lookups which are open-coded and often performance critical (this is the reason the colour isn't bit encoded in either of the child pointers either). But if you only want a uni directional thread, I guess we can stick it in the unsigned long we use for the node colour. Still, perhaps it's worth it to grow rb_node to 4 words and do the fully threaded thing as there are also a lot of rb_prev() users in the kernel. Who knows.. Anyway, I agree that improving rb_next() is worth looking into for the scheduler. For the IO scheduler as well, it's used quite extensively! So speeding up rb_next() would definitely help, as it's typically invoked for every bio queued (attempting to back merge with the next request). CFQ and AS additionally does an rb_next() and rb_prev() when trying to decide which request to do next. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
On Thu, Feb 21 2008, Jens Axboe wrote: On Thu, Feb 21 2008, Peter Zijlstra wrote: On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: You use the empty pointer (missing right child), so why do we need a list. May be I am missing something. A fully threaded tree also has back-pointer to traverse backwards through the ordered elements. That said, overloading the right child pointer might not be the best thing for the linux kernel, as it will impact all the rb-tree lookups which are open-coded and often performance critical (this is the reason the colour isn't bit encoded in either of the child pointers either). But if you only want a uni directional thread, I guess we can stick it in the unsigned long we use for the node colour. Still, perhaps it's worth it to grow rb_node to 4 words and do the fully threaded thing as there are also a lot of rb_prev() users in the kernel. Who knows.. Anyway, I agree that improving rb_next() is worth looking into for the scheduler. For the IO scheduler as well, it's used quite extensively! So speeding up rb_next() would definitely help, as it's typically invoked for every bio queued (attempting to back merge with the next request). CFQ and AS additionally does an rb_next() and rb_prev() when trying to decide which request to do next. One possible course of action to implement this without eating extra space in the rb_node would be: - Add rb_right() and rb_set_right() (plus ditto _left variants) to rbtree.h - Convert all in-kernel users to use these. Quite extensive, as the rbtree code search/insert functions are coded in situ and not in rbtree.[ch] - Now we can overload bit 0 of -rb_right and -rb_left to indicate whether this is a node or thread pointer and modify rbtree.c to tag and add the thread links when appropriate. So we can definitely do this in a compatible fashion. Given that I have a flight coming up in a few days time, I may give it a got if no one beats me to it :-) -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Jens Axboe wrote: On Thu, Feb 21 2008, Jens Axboe wrote: On Thu, Feb 21 2008, Peter Zijlstra wrote: On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote: You use the empty pointer (missing right child), so why do we need a list. May be I am missing something. A fully threaded tree also has back-pointer to traverse backwards through the ordered elements. That said, overloading the right child pointer might not be the best thing for the linux kernel, as it will impact all the rb-tree lookups which are open-coded and often performance critical (this is the reason the colour isn't bit encoded in either of the child pointers either). But if you only want a uni directional thread, I guess we can stick it in the unsigned long we use for the node colour. Still, perhaps it's worth it to grow rb_node to 4 words and do the fully threaded thing as there are also a lot of rb_prev() users in the kernel. Who knows.. Anyway, I agree that improving rb_next() is worth looking into for the scheduler. For the IO scheduler as well, it's used quite extensively! So speeding up rb_next() would definitely help, as it's typically invoked for every bio queued (attempting to back merge with the next request). CFQ and AS additionally does an rb_next() and rb_prev() when trying to decide which request to do next. One possible course of action to implement this without eating extra space in the rb_node would be: - Add rb_right() and rb_set_right() (plus ditto _left variants) to rbtree.h - Convert all in-kernel users to use these. Quite extensive, as the rbtree code search/insert functions are coded in situ and not in rbtree.[ch] - Now we can overload bit 0 of -rb_right and -rb_left to indicate whether this is a node or thread pointer and modify rbtree.c to tag and add the thread links when appropriate. Exactly along the lines I was thinking of.and discussing with David. So we can definitely do this in a compatible fashion. Given that I have a flight coming up in a few days time, I may give it a got if no one beats me to it :-) Feel free to do so, please do keep me on the cc. I am very interested in getting rb threaded trees done, but my bandwidth is a little limited this month. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Ingo Molnar wrote: > * Balbir Singh <[EMAIL PROTECTED]> wrote: > >> I disagree. The cost is only adding a field to cfs_rq [...] > > wrong. The cost is "only" of adding a field to cfs_rq and _updating it_, > in the hottest paths of the scheduler: > > @@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_ > */ > if (key < entity_key(cfs_rq, entry)) { > link = >rb_left; > + rightmost = 0; That's an update when we move leftwards. > } else { > link = >rb_right; > leftmost = 0; > @@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_ > */ > if (leftmost) > cfs_rq->rb_leftmost = >run_node; > + if (rightmost) > + cfs_rq->rb_rightmost = >run_node; > >run_node is already in the cache, we are assigning cfs_rq->rb_rightmost to it. >> [...] For a large number of tasks - say 1, we need to walk 14 >> levels before we reach the node (each time). [...] > > 10,000 yield-ing tasks is not a common workload we care about. It's not > even a rare workload we care about. _Especially_ we dont care about it > if it slows down every other workload (a tiny bit). > sched_yield() is supported API and also look at http://lkml.org/lkml/2007/9/19/351. I am trying to make sched_yield() efficient when compat_sched_yield is turned on (which is most likely), since people will want that behaviour (Hint, please read the man-page for sched_yield).There are already several applications using sched_yield(), so they all suffer. >> [...] Doesn't matter if the data is cached, we are still spending CPU >> time looking through pointers and walking to the right node. [...] > > have you actually measured how much it takes to walk the tree that deep > on recent hardware? I have. I have measured how much time can be saved by not doing that and it's quite a lot. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
* Balbir Singh <[EMAIL PROTECTED]> wrote: > I disagree. The cost is only adding a field to cfs_rq [...] wrong. The cost is "only" of adding a field to cfs_rq and _updating it_, in the hottest paths of the scheduler: @@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_ */ if (key < entity_key(cfs_rq, entry)) { link = >rb_left; + rightmost = 0; } else { link = >rb_right; leftmost = 0; @@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_ */ if (leftmost) cfs_rq->rb_leftmost = >run_node; + if (rightmost) + cfs_rq->rb_rightmost = >run_node; > [...] For a large number of tasks - say 1, we need to walk 14 > levels before we reach the node (each time). [...] 10,000 yield-ing tasks is not a common workload we care about. It's not even a rare workload we care about. _Especially_ we dont care about it if it slows down every other workload (a tiny bit). > [...] Doesn't matter if the data is cached, we are still spending CPU > time looking through pointers and walking to the right node. [...] have you actually measured how much it takes to walk the tree that deep on recent hardware? I have. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Ingo Molnar wrote: > * Balbir Singh <[EMAIL PROTECTED]> wrote: > >> __pick_last_entity() walks the entire tree on O(lgn) time to find the >> rightmost entry. This patch makes the routine more efficient by >> reducing the cost of the lookup > > hm, i'm not sure we want to do this: we'd be slowing down the fastpath > of all the other common scheduler functions, for the sake of a rarely > used (and broken ...) API: yield. And note that an rbtree walk is not > slow at all - if you are yielding frequently it's probably all cached. > > Ingo Ingo, I disagree. The cost is only adding a field to cfs_rq and we already have the logic to track the leftmost node, we just update the rightmost node as well. For a large number of tasks - say 1, we need to walk 14 levels before we reach the node (each time). Doesn't matter if the data is cached, we are still spending CPU time looking through pointers and walking to the right node. If all the threads get a chance to run, you can imagine the effect it has on efficiency and the data cache. I see a boost in sched_yield performance and no big hit on regular performance. Could you please test it to see if your apprehensions are correct. PS: You missed to add me on the cc/to list. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
* Balbir Singh <[EMAIL PROTECTED]> wrote: > __pick_last_entity() walks the entire tree on O(lgn) time to find the > rightmost entry. This patch makes the routine more efficient by > reducing the cost of the lookup hm, i'm not sure we want to do this: we'd be slowing down the fastpath of all the other common scheduler functions, for the sake of a rarely used (and broken ...) API: yield. And note that an rbtree walk is not slow at all - if you are yielding frequently it's probably all cached. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
* Balbir Singh [EMAIL PROTECTED] wrote: __pick_last_entity() walks the entire tree on O(lgn) time to find the rightmost entry. This patch makes the routine more efficient by reducing the cost of the lookup hm, i'm not sure we want to do this: we'd be slowing down the fastpath of all the other common scheduler functions, for the sake of a rarely used (and broken ...) API: yield. And note that an rbtree walk is not slow at all - if you are yielding frequently it's probably all cached. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Ingo Molnar wrote: * Balbir Singh [EMAIL PROTECTED] wrote: __pick_last_entity() walks the entire tree on O(lgn) time to find the rightmost entry. This patch makes the routine more efficient by reducing the cost of the lookup hm, i'm not sure we want to do this: we'd be slowing down the fastpath of all the other common scheduler functions, for the sake of a rarely used (and broken ...) API: yield. And note that an rbtree walk is not slow at all - if you are yielding frequently it's probably all cached. Ingo Ingo, I disagree. The cost is only adding a field to cfs_rq and we already have the logic to track the leftmost node, we just update the rightmost node as well. For a large number of tasks - say 1, we need to walk 14 levels before we reach the node (each time). Doesn't matter if the data is cached, we are still spending CPU time looking through pointers and walking to the right node. If all the threads get a chance to run, you can imagine the effect it has on efficiency and the data cache. I see a boost in sched_yield performance and no big hit on regular performance. Could you please test it to see if your apprehensions are correct. PS: You missed to add me on the cc/to list. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
* Balbir Singh [EMAIL PROTECTED] wrote: I disagree. The cost is only adding a field to cfs_rq [...] wrong. The cost is only of adding a field to cfs_rq and _updating it_, in the hottest paths of the scheduler: @@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_ */ if (key entity_key(cfs_rq, entry)) { link = parent-rb_left; + rightmost = 0; } else { link = parent-rb_right; leftmost = 0; @@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_ */ if (leftmost) cfs_rq-rb_leftmost = se-run_node; + if (rightmost) + cfs_rq-rb_rightmost = se-run_node; [...] For a large number of tasks - say 1, we need to walk 14 levels before we reach the node (each time). [...] 10,000 yield-ing tasks is not a common workload we care about. It's not even a rare workload we care about. _Especially_ we dont care about it if it slows down every other workload (a tiny bit). [...] Doesn't matter if the data is cached, we are still spending CPU time looking through pointers and walking to the right node. [...] have you actually measured how much it takes to walk the tree that deep on recent hardware? I have. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make yield_task_fair more efficient
Ingo Molnar wrote: * Balbir Singh [EMAIL PROTECTED] wrote: I disagree. The cost is only adding a field to cfs_rq [...] wrong. The cost is only of adding a field to cfs_rq and _updating it_, in the hottest paths of the scheduler: @@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_ */ if (key entity_key(cfs_rq, entry)) { link = parent-rb_left; + rightmost = 0; That's an update when we move leftwards. } else { link = parent-rb_right; leftmost = 0; @@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_ */ if (leftmost) cfs_rq-rb_leftmost = se-run_node; + if (rightmost) + cfs_rq-rb_rightmost = se-run_node; se-run_node is already in the cache, we are assigning cfs_rq-rb_rightmost to it. [...] For a large number of tasks - say 1, we need to walk 14 levels before we reach the node (each time). [...] 10,000 yield-ing tasks is not a common workload we care about. It's not even a rare workload we care about. _Especially_ we dont care about it if it slows down every other workload (a tiny bit). sched_yield() is supported API and also look at http://lkml.org/lkml/2007/9/19/351. I am trying to make sched_yield() efficient when compat_sched_yield is turned on (which is most likely), since people will want that behaviour (Hint, please read the man-page for sched_yield).There are already several applications using sched_yield(), so they all suffer. [...] Doesn't matter if the data is cached, we are still spending CPU time looking through pointers and walking to the right node. [...] have you actually measured how much it takes to walk the tree that deep on recent hardware? I have. I have measured how much time can be saved by not doing that and it's quite a lot. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/