Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Jens Axboe wrote:
> On Thu, Feb 21 2008, Jens Axboe wrote:
>> On Thu, Feb 21 2008, Peter Zijlstra wrote:
>>> On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
>>>
 You use the empty pointer (missing right child), so why do we need a list. 
 May
 be I am missing something.
>>> A fully threaded tree also has back-pointer to traverse backwards
>>> through the ordered elements.
>>>
>>> That said, overloading the right child pointer might not be the best
>>> thing for the linux kernel, as it will impact all the rb-tree lookups
>>> which are open-coded and often performance critical (this is the reason
>>> the colour isn't bit encoded in either of the child pointers either).
>>>
>>> But if you only want a uni directional thread, I guess we can stick it
>>> in the unsigned long we use for the node colour.
>>>
>>> Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
>>> threaded thing as there are also a lot of rb_prev() users in the kernel.
>>> Who knows..
>>>
>>> Anyway, I agree that improving rb_next() is worth looking into for the
>>> scheduler.
>> For the IO scheduler as well, it's used quite extensively! So speeding
>> up rb_next() would definitely help, as it's typically invoked for every
>> bio queued (attempting to back merge with the next request). CFQ and AS
>> additionally does an rb_next() and rb_prev() when trying to decide which
>> request to do next.
> 
> One possible course of action to implement this without eating extra
> space in the rb_node would be:
> 
> - Add rb_right() and rb_set_right() (plus ditto _left variants) to
>   rbtree.h
> - Convert all in-kernel users to use these. Quite extensive, as the
>   rbtree code search/insert functions are coded in situ and not in
>   rbtree.[ch]
> - Now we can overload bit 0 of ->rb_right and ->rb_left to indicate
>   whether this is a node or thread pointer and modify rbtree.c to tag
>   and add the thread links when appropriate.
> 

Exactly along the lines I was thinking of.and discussing with David.

> So we can definitely do this in a compatible fashion. Given that I have
> a flight coming up in a few days time, I may give it a got if no one
> beats me to it :-)
> 

Feel free to do so, please do keep me on the cc. I am very interested in getting
rb threaded trees done, but my bandwidth is a little limited this month.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Jens Axboe
On Thu, Feb 21 2008, Jens Axboe wrote:
> On Thu, Feb 21 2008, Peter Zijlstra wrote:
> > 
> > On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
> > 
> > > You use the empty pointer (missing right child), so why do we need a 
> > > list. May
> > > be I am missing something.
> > 
> > A fully threaded tree also has back-pointer to traverse backwards
> > through the ordered elements.
> > 
> > That said, overloading the right child pointer might not be the best
> > thing for the linux kernel, as it will impact all the rb-tree lookups
> > which are open-coded and often performance critical (this is the reason
> > the colour isn't bit encoded in either of the child pointers either).
> > 
> > But if you only want a uni directional thread, I guess we can stick it
> > in the unsigned long we use for the node colour.
> > 
> > Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
> > threaded thing as there are also a lot of rb_prev() users in the kernel.
> > Who knows..
> > 
> > Anyway, I agree that improving rb_next() is worth looking into for the
> > scheduler.
> 
> For the IO scheduler as well, it's used quite extensively! So speeding
> up rb_next() would definitely help, as it's typically invoked for every
> bio queued (attempting to back merge with the next request). CFQ and AS
> additionally does an rb_next() and rb_prev() when trying to decide which
> request to do next.

One possible course of action to implement this without eating extra
space in the rb_node would be:

- Add rb_right() and rb_set_right() (plus ditto _left variants) to
  rbtree.h
- Convert all in-kernel users to use these. Quite extensive, as the
  rbtree code search/insert functions are coded in situ and not in
  rbtree.[ch]
- Now we can overload bit 0 of ->rb_right and ->rb_left to indicate
  whether this is a node or thread pointer and modify rbtree.c to tag
  and add the thread links when appropriate.

So we can definitely do this in a compatible fashion. Given that I have
a flight coming up in a few days time, I may give it a got if no one
beats me to it :-)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Jens Axboe
On Thu, Feb 21 2008, Peter Zijlstra wrote:
> 
> On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
> 
> > You use the empty pointer (missing right child), so why do we need a list. 
> > May
> > be I am missing something.
> 
> A fully threaded tree also has back-pointer to traverse backwards
> through the ordered elements.
> 
> That said, overloading the right child pointer might not be the best
> thing for the linux kernel, as it will impact all the rb-tree lookups
> which are open-coded and often performance critical (this is the reason
> the colour isn't bit encoded in either of the child pointers either).
> 
> But if you only want a uni directional thread, I guess we can stick it
> in the unsigned long we use for the node colour.
> 
> Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
> threaded thing as there are also a lot of rb_prev() users in the kernel.
> Who knows..
> 
> Anyway, I agree that improving rb_next() is worth looking into for the
> scheduler.

For the IO scheduler as well, it's used quite extensively! So speeding
up rb_next() would definitely help, as it's typically invoked for every
bio queued (attempting to back merge with the next request). CFQ and AS
additionally does an rb_next() and rb_prev() when trying to decide which
request to do next.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Jörn Engel
On Thu, 21 February 2008 12:21:33 +0530, Balbir Singh wrote:
> 
> For a large number of tasks - say 1, we need to walk 14 levels before we

16.7, actually.  rbtrees are balanced treed, but they are not balanced
binary trees.  The average fan-out is sqrt(3) instead of 2.

Jörn

-- 
The cost of changing business rules is much more expensive for software
than for a secretaty.
-- unknown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Mike Galbraith wrote:
> Hi,
> 
> On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
>> Ingo Molnar wrote:
>>> * Balbir Singh <[EMAIL PROTECTED]> wrote:
>>>
 If you insist that sched_yield() is bad, I might agree, but how does 
 my patch make things worse. [...]
>>> it puts new instructions into the hotpath.
>>>
 [...] In my benchmarks, it has helped the sched_yield case, why is 
 that bad? [...]
>>> I had the same cache for the rightmost task in earlier CFS (it's a 
>>> really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
>>> the fastpath hence i removed it. Algorithms and implementations are a 
>>> constant balancing act.
>> This is more convincing, was the code ever in git? How did you measure the
>> overhead?
> 
> Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds
> netperf test that does ~85k/s context switches  shows:
> 
> sched_cycles: 7198444348 unpatched
> vs
> sched_cycles: 8574036268 patched


Thanks for the numbers! I am very convinced that the patch should stay out until
we can find a way to reduce the overhead. I'll try your patch and see what the
numbers look like as well.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Mike Galbraith

On Thu, 2008-02-21 at 13:02 +0100, Mike Galbraith wrote:
> sched_cycles: 7198444348 unpatched
> vs
> sched_cycles: 8574036268 patched

(in case it's not clear, patched means your patch, not my quick/dirty
countem patch:)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Mike Galbraith
Hi,

On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
> Ingo Molnar wrote:
> > * Balbir Singh <[EMAIL PROTECTED]> wrote:
> > 
> >> If you insist that sched_yield() is bad, I might agree, but how does 
> >> my patch make things worse. [...]
> > 
> > it puts new instructions into the hotpath.
> > 
> >> [...] In my benchmarks, it has helped the sched_yield case, why is 
> >> that bad? [...]
> > 
> > I had the same cache for the rightmost task in earlier CFS (it's a 
> > really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
> > the fastpath hence i removed it. Algorithms and implementations are a 
> > constant balancing act.
> 
> This is more convincing, was the code ever in git? How did you measure the
> overhead?

Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds
netperf test that does ~85k/s context switches  shows:

sched_cycles: 7198444348 unpatched
vs
sched_cycles: 8574036268 patched

-Mike

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e217d18..a71edfe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1262,6 +1262,7 @@ struct task_struct {
int latency_record_count;
struct latency_record latency_record[LT_SAVECOUNT];
 #endif
+   unsigned long long sched_cycles;
 };
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 506a957..c5c3d5e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -174,6 +174,8 @@ repeat:
}
 
write_unlock_irq(_lock);
+   if (!strcmp("netperf", p->comm) || !strcmp("netserver", p->comm))
+   printk(KERN_WARNING "%s:%d sched_cycles: %Ld\n", p->comm, 
p->pid, p->sched_cycles);
release_thread(p);
call_rcu(>rcu, delayed_put_task_struct);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index f28f19e..d7c1b0b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1301,15 +1301,25 @@ static void set_load_weight(struct task_struct *p)
 
 static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 {
+   unsigned long long then, now;
+
+   rdtscll(then);
sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, wakeup);
p->se.on_rq = 1;
+   rdtscll(now);
+   p->sched_cycles += now - then;
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
+   unsigned long long then, now;
+
+   rdtscll(then);
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;
+   rdtscll(now);
+   p->sched_cycles += now - then;
 }
 
 /*
@@ -2009,6 +2019,7 @@ void wake_up_new_task(struct task_struct *p, unsigned 
long clone_flags)
update_rq_clock(rq);
 
p->prio = effective_prio(p);
+   p->sched_cycles = 0;
 
if (!p->sched_class->task_new || !current->se.on_rq) {
activate_task(rq, p, 0);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
> 
>> You use the empty pointer (missing right child), so why do we need a list. 
>> May
>> be I am missing something.
> 
> A fully threaded tree also has back-pointer to traverse backwards
> through the ordered elements.
> 

The back pointer is implemented using missing left children. I am referring to
Don Knuth's volume 1: The art of computer programming, section 2.3.1, page 322.
Please see the threaded representation of the tree, it is compared with the
unthreaded representation.

> That said, overloading the right child pointer might not be the best
> thing for the linux kernel, as it will impact all the rb-tree lookups
> which are open-coded and often performance critical (this is the reason
> the colour isn't bit encoded in either of the child pointers either).
> 

We always look at the child, irrespective of whether it is to the right or left
to terminate our walk. Overloading them and setting a bit stating it is threaded
should not be that bad.

> But if you only want a uni directional thread, I guess we can stick it
> in the unsigned long we use for the node colour.
> 
> Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
> threaded thing as there are also a lot of rb_prev() users in the kernel.
> Who knows..
> 

Why does the rb_node need to grow? We can encode the bits in the children

> Anyway, I agree that improving rb_next() is worth looking into for the
> scheduler.

Sure, will experiment and see if we can bump up performance.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Peter Zijlstra

On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:

> You use the empty pointer (missing right child), so why do we need a list. May
> be I am missing something.

A fully threaded tree also has back-pointer to traverse backwards
through the ordered elements.

That said, overloading the right child pointer might not be the best
thing for the linux kernel, as it will impact all the rb-tree lookups
which are open-coded and often performance critical (this is the reason
the colour isn't bit encoded in either of the child pointers either).

But if you only want a uni directional thread, I guess we can stick it
in the unsigned long we use for the node colour.

Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
threaded thing as there are also a lot of rb_prev() users in the kernel.
Who knows..

Anyway, I agree that improving rb_next() is worth looking into for the
scheduler.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Balbir Singh wrote:
> Ingo Molnar wrote:
>> * Balbir Singh <[EMAIL PROTECTED]> wrote:
>>
>>> If you insist that sched_yield() is bad, I might agree, but how does 
>>> my patch make things worse. [...]
>> it puts new instructions into the hotpath.
>>
>>> [...] In my benchmarks, it has helped the sched_yield case, why is 
>>> that bad? [...]
>> I had the same cache for the rightmost task in earlier CFS (it's a 
>> really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
>> the fastpath hence i removed it. Algorithms and implementations are a 
>> constant balancing act.
> 
> This is more convincing, was the code ever in git? How did you measure the
> overhead? What are your plans for reports with regressions where
> kernel.compat_sched_yield is set to 1?
> 

Ingo,

I was looking through nptl/sysdeps/unix/sysv/linux/pthread_yield.c in the glibc
sources and it looks like pthread_yield() calls sched_yield(). Applications
using compat_sched_yield and pthread_yield() are also going to be impacted.

I searched for pthread_yield and sched_yield using google codesearch to look at
the applications that use these routines, the search list is too big for me to
mark applications as candidates for potential improvement.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote:
>> Peter Zijlstra wrote:
>>> On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
>>>
 I have an alternate approach in mind (that I need to find time for),
 threaded-rbtrees. Walking the tree is really efficient, specially finding
 successor of a node.
>>> Threading the rbtrees would be even more expensive, it would require a
>>> list_head in each node and a full list operation for every tree
>>> operation.
>>>
>> Peter, when I say threaded, I don't mean threaded as in tasks. Please see
>> http://www.nist.gov/dads/HTML/threadedtree.html and
>> http://datastructures.itgo.com/trees/tbt.htm
> 
> I'm quite aware of the meaning of threaded in the context of data
> structures.

Oh! I did not mean to assume anything. Just wanted to make sure we are talking
of the same thing

 They usually require additional list data and operations to
> implement.
> 

>From the definition
A Threaded Binary Tree is a binary tree in which every node that does not have a
right child has a THREAD (in actual sense, a link) to its INORDER successor.

You use the empty pointer (missing right child), so why do we need a list. May
be I am missing something.

> Look at how the latter link you provided grows the rb_node with one
> pointer to provide a single linked list.
> 

Yes, that's a bad example I suppose. Look at
http://www.cs.rutgers.edu/~kaplan/503/thread.html

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Peter Zijlstra

On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote:
> Peter Zijlstra wrote:
> > On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
> > 
> >> I have an alternate approach in mind (that I need to find time for),
> >> threaded-rbtrees. Walking the tree is really efficient, specially finding
> >> successor of a node.
> > 
> > Threading the rbtrees would be even more expensive, it would require a
> > list_head in each node and a full list operation for every tree
> > operation.
> > 
> 
> Peter, when I say threaded, I don't mean threaded as in tasks. Please see
> http://www.nist.gov/dads/HTML/threadedtree.html and
> http://datastructures.itgo.com/trees/tbt.htm

I'm quite aware of the meaning of threaded in the context of data
structures. They usually require additional list data and operations to
implement.

Look at how the latter link you provided grows the rb_node with one
pointer to provide a single linked list.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Ingo Molnar wrote:
> * Balbir Singh <[EMAIL PROTECTED]> wrote:
> 

You did not answer some of my earlier questions.

>> I have an alternate approach in mind (that I need to find time for), 
>> threaded-rbtrees. Walking the tree is really efficient, specially 
>> finding successor of a node.
> 
> sure, feel free to experiment with those details.
> 
> But if you want to improve Java workloads then you should probably start 
> by converting them to futexes instead of sched_yield(), that will 
> probably give far more performance than micro-optimizing the 
> sys_sched_yield() codepath. (especially when it happens at the expense 
> of other workloads, which is not acceptable for mainline) You can 
> rebuild your JVM easily and re-test with its locking fixed, right?

No.. I don't want to optimize the JVM or rebuild it. I don't have access to any
JVM code either. I wanted to do the threaded rb-trees for the case where we use
spend time finding the successor using rb_next() (walking the tree).

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
> 
>> I have an alternate approach in mind (that I need to find time for),
>> threaded-rbtrees. Walking the tree is really efficient, specially finding
>> successor of a node.
> 
> Threading the rbtrees would be even more expensive, it would require a
> list_head in each node and a full list operation for every tree
> operation.
> 

Peter, when I say threaded, I don't mean threaded as in tasks. Please see
http://www.nist.gov/dads/HTML/threadedtree.html and
http://datastructures.itgo.com/trees/tbt.htm

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Peter Zijlstra

On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:

> I have an alternate approach in mind (that I need to find time for),
> threaded-rbtrees. Walking the tree is really efficient, specially finding
> successor of a node.

Threading the rbtrees would be even more expensive, it would require a
list_head in each node and a full list operation for every tree
operation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Ingo Molnar

* Balbir Singh <[EMAIL PROTECTED]> wrote:

> I have an alternate approach in mind (that I need to find time for), 
> threaded-rbtrees. Walking the tree is really efficient, specially 
> finding successor of a node.

sure, feel free to experiment with those details.

But if you want to improve Java workloads then you should probably start 
by converting them to futexes instead of sched_yield(), that will 
probably give far more performance than micro-optimizing the 
sys_sched_yield() codepath. (especially when it happens at the expense 
of other workloads, which is not acceptable for mainline) You can 
rebuild your JVM easily and re-test with its locking fixed, right?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Ingo Molnar wrote:
> * Balbir Singh <[EMAIL PROTECTED]> wrote:
> 
>> If you insist that sched_yield() is bad, I might agree, but how does 
>> my patch make things worse. [...]
> 
> it puts new instructions into the hotpath.
> 
>> [...] In my benchmarks, it has helped the sched_yield case, why is 
>> that bad? [...]
> 
> I had the same cache for the rightmost task in earlier CFS (it's a 
> really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
> the fastpath hence i removed it. Algorithms and implementations are a 
> constant balancing act.

This is more convincing, was the code ever in git? How did you measure the
overhead? What are your plans for reports with regressions where
kernel.compat_sched_yield is set to 1?

I have an alternate approach in mind (that I need to find time for),
threaded-rbtrees. Walking the tree is really efficient, specially finding
successor of a node.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Ingo Molnar

* Balbir Singh <[EMAIL PROTECTED]> wrote:

> If you insist that sched_yield() is bad, I might agree, but how does 
> my patch make things worse. [...]

it puts new instructions into the hotpath.

> [...] In my benchmarks, it has helped the sched_yield case, why is 
> that bad? [...]

I had the same cache for the rightmost task in earlier CFS (it's a 
really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
the fastpath hence i removed it. Algorithms and implementations are a 
constant balancing act.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote:
> 
>> sched_yield() is supported API
> 
> For SCHED_FIFO/SCHED_RR.
> 
>>  and also look at http://lkml.org/lkml/2007/9/19/351.
> 
> Read on (http://lkml.org/lkml/2007/9/19/371) and find:
> 
> The sched_yield() behaviour is actually very well-defined for RT
> tasks (now, whether it's a good interface to use or not is still
> open to debate, but at least it's a _defined_ interface ;), and
> I think we should at least try to approximate that behaviour for
> normal tasks, even if they aren't RT.
> 
> sched_yield() just isn't a very useful API except in RT and even there
> one can question it.
> 

But it's being used very heavily by existing applications. Can we break/penalize
them because we think it's not a good interface or a method to program?


>> I am trying to make sched_yield() efficient
>> when compat_sched_yield is turned on (which is most likely), since people 
>> will
>> want that behaviour (Hint, please read the man-page for sched_yield).
> 
> I did, so what? read the POSIX spec - its just plain undefined for
> SCHED_OTHER. We already do something 'sensible' for those few broken
> applications relying on unspecified behaviour.
> 
>> There are
>> already several applications using sched_yield(), so they all suffer.
> 
> Who is using it, where is their source so we can show its faster to not
> use it?
> 
> Really, hiding behind closed sores doesn't make it good.
> 
> 

I am not hiding behind closed source. Please refer to the regression reported by
Yanmin where he used compat_sched_yield=1 (see
http://lkml.org/lkml/2008/2/18/7). The regression might be due to other reasons,
but the point is that people are using compat_sched_yield=1

If you insist that sched_yield() is bad, I might agree, but how does my patch
make things worse. In my benchmarks, it has helped the sched_yield case, why is
that bad? As far as touching the hot-path is concerned, give me a benchmark that
I can run on my desktop, that will convince you that the overhead of the patch
is not all that high. If there is an overhead that is very visible, I'll stop
pushing the patch.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Peter Zijlstra

On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote:

> sched_yield() is supported API

For SCHED_FIFO/SCHED_RR.

>  and also look at http://lkml.org/lkml/2007/9/19/351.

Read on (http://lkml.org/lkml/2007/9/19/371) and find:

The sched_yield() behaviour is actually very well-defined for RT
tasks (now, whether it's a good interface to use or not is still
open to debate, but at least it's a _defined_ interface ;), and
I think we should at least try to approximate that behaviour for
normal tasks, even if they aren't RT.

sched_yield() just isn't a very useful API except in RT and even there
one can question it.

> I am trying to make sched_yield() efficient
> when compat_sched_yield is turned on (which is most likely), since people will
> want that behaviour (Hint, please read the man-page for sched_yield).

I did, so what? read the POSIX spec - its just plain undefined for
SCHED_OTHER. We already do something 'sensible' for those few broken
applications relying on unspecified behaviour.

> There are
> already several applications using sched_yield(), so they all suffer.

Who is using it, where is their source so we can show its faster to not
use it?

Really, hiding behind closed sores doesn't make it good.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Peter Zijlstra

On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote:

 sched_yield() is supported API

For SCHED_FIFO/SCHED_RR.

  and also look at http://lkml.org/lkml/2007/9/19/351.

Read on (http://lkml.org/lkml/2007/9/19/371) and find:

The sched_yield() behaviour is actually very well-defined for RT
tasks (now, whether it's a good interface to use or not is still
open to debate, but at least it's a _defined_ interface ;), and
I think we should at least try to approximate that behaviour for
normal tasks, even if they aren't RT.

sched_yield() just isn't a very useful API except in RT and even there
one can question it.

 I am trying to make sched_yield() efficient
 when compat_sched_yield is turned on (which is most likely), since people will
 want that behaviour (Hint, please read the man-page for sched_yield).

I did, so what? read the POSIX spec - its just plain undefined for
SCHED_OTHER. We already do something 'sensible' for those few broken
applications relying on unspecified behaviour.

 There are
 already several applications using sched_yield(), so they all suffer.

Who is using it, where is their source so we can show its faster to not
use it?

Really, hiding behind closed sores doesn't make it good.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Peter Zijlstra wrote:
 On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote:
 
 sched_yield() is supported API
 
 For SCHED_FIFO/SCHED_RR.
 
  and also look at http://lkml.org/lkml/2007/9/19/351.
 
 Read on (http://lkml.org/lkml/2007/9/19/371) and find:
 
 The sched_yield() behaviour is actually very well-defined for RT
 tasks (now, whether it's a good interface to use or not is still
 open to debate, but at least it's a _defined_ interface ;), and
 I think we should at least try to approximate that behaviour for
 normal tasks, even if they aren't RT.
 
 sched_yield() just isn't a very useful API except in RT and even there
 one can question it.
 

But it's being used very heavily by existing applications. Can we break/penalize
them because we think it's not a good interface or a method to program?


 I am trying to make sched_yield() efficient
 when compat_sched_yield is turned on (which is most likely), since people 
 will
 want that behaviour (Hint, please read the man-page for sched_yield).
 
 I did, so what? read the POSIX spec - its just plain undefined for
 SCHED_OTHER. We already do something 'sensible' for those few broken
 applications relying on unspecified behaviour.
 
 There are
 already several applications using sched_yield(), so they all suffer.
 
 Who is using it, where is their source so we can show its faster to not
 use it?
 
 Really, hiding behind closed sores doesn't make it good.
 
 

I am not hiding behind closed source. Please refer to the regression reported by
Yanmin where he used compat_sched_yield=1 (see
http://lkml.org/lkml/2008/2/18/7). The regression might be due to other reasons,
but the point is that people are using compat_sched_yield=1

If you insist that sched_yield() is bad, I might agree, but how does my patch
make things worse. In my benchmarks, it has helped the sched_yield case, why is
that bad? As far as touching the hot-path is concerned, give me a benchmark that
I can run on my desktop, that will convince you that the overhead of the patch
is not all that high. If there is an overhead that is very visible, I'll stop
pushing the patch.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Ingo Molnar

* Balbir Singh [EMAIL PROTECTED] wrote:

 If you insist that sched_yield() is bad, I might agree, but how does 
 my patch make things worse. [...]

it puts new instructions into the hotpath.

 [...] In my benchmarks, it has helped the sched_yield case, why is 
 that bad? [...]

I had the same cache for the rightmost task in earlier CFS (it's a 
really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
the fastpath hence i removed it. Algorithms and implementations are a 
constant balancing act.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Ingo Molnar wrote:
 * Balbir Singh [EMAIL PROTECTED] wrote:
 
 If you insist that sched_yield() is bad, I might agree, but how does 
 my patch make things worse. [...]
 
 it puts new instructions into the hotpath.
 
 [...] In my benchmarks, it has helped the sched_yield case, why is 
 that bad? [...]
 
 I had the same cache for the rightmost task in earlier CFS (it's a 
 really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
 the fastpath hence i removed it. Algorithms and implementations are a 
 constant balancing act.

This is more convincing, was the code ever in git? How did you measure the
overhead? What are your plans for reports with regressions where
kernel.compat_sched_yield is set to 1?

I have an alternate approach in mind (that I need to find time for),
threaded-rbtrees. Walking the tree is really efficient, specially finding
successor of a node.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Ingo Molnar

* Balbir Singh [EMAIL PROTECTED] wrote:

 I have an alternate approach in mind (that I need to find time for), 
 threaded-rbtrees. Walking the tree is really efficient, specially 
 finding successor of a node.

sure, feel free to experiment with those details.

But if you want to improve Java workloads then you should probably start 
by converting them to futexes instead of sched_yield(), that will 
probably give far more performance than micro-optimizing the 
sys_sched_yield() codepath. (especially when it happens at the expense 
of other workloads, which is not acceptable for mainline) You can 
rebuild your JVM easily and re-test with its locking fixed, right?

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Peter Zijlstra

On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:

 I have an alternate approach in mind (that I need to find time for),
 threaded-rbtrees. Walking the tree is really efficient, specially finding
 successor of a node.

Threading the rbtrees would be even more expensive, it would require a
list_head in each node and a full list operation for every tree
operation.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Ingo Molnar wrote:
 * Balbir Singh [EMAIL PROTECTED] wrote:
 

You did not answer some of my earlier questions.

 I have an alternate approach in mind (that I need to find time for), 
 threaded-rbtrees. Walking the tree is really efficient, specially 
 finding successor of a node.
 
 sure, feel free to experiment with those details.
 
 But if you want to improve Java workloads then you should probably start 
 by converting them to futexes instead of sched_yield(), that will 
 probably give far more performance than micro-optimizing the 
 sys_sched_yield() codepath. (especially when it happens at the expense 
 of other workloads, which is not acceptable for mainline) You can 
 rebuild your JVM easily and re-test with its locking fixed, right?

No.. I don't want to optimize the JVM or rebuild it. I don't have access to any
JVM code either. I wanted to do the threaded rb-trees for the case where we use
spend time finding the successor using rb_next() (walking the tree).

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Peter Zijlstra wrote:
 On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
 
 I have an alternate approach in mind (that I need to find time for),
 threaded-rbtrees. Walking the tree is really efficient, specially finding
 successor of a node.
 
 Threading the rbtrees would be even more expensive, it would require a
 list_head in each node and a full list operation for every tree
 operation.
 

Peter, when I say threaded, I don't mean threaded as in tasks. Please see
http://www.nist.gov/dads/HTML/threadedtree.html and
http://datastructures.itgo.com/trees/tbt.htm

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Peter Zijlstra

On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote:
 Peter Zijlstra wrote:
  On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
  
  I have an alternate approach in mind (that I need to find time for),
  threaded-rbtrees. Walking the tree is really efficient, specially finding
  successor of a node.
  
  Threading the rbtrees would be even more expensive, it would require a
  list_head in each node and a full list operation for every tree
  operation.
  
 
 Peter, when I say threaded, I don't mean threaded as in tasks. Please see
 http://www.nist.gov/dads/HTML/threadedtree.html and
 http://datastructures.itgo.com/trees/tbt.htm

I'm quite aware of the meaning of threaded in the context of data
structures. They usually require additional list data and operations to
implement.

Look at how the latter link you provided grows the rb_node with one
pointer to provide a single linked list.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Peter Zijlstra wrote:
 On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote:
 Peter Zijlstra wrote:
 On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:

 I have an alternate approach in mind (that I need to find time for),
 threaded-rbtrees. Walking the tree is really efficient, specially finding
 successor of a node.
 Threading the rbtrees would be even more expensive, it would require a
 list_head in each node and a full list operation for every tree
 operation.

 Peter, when I say threaded, I don't mean threaded as in tasks. Please see
 http://www.nist.gov/dads/HTML/threadedtree.html and
 http://datastructures.itgo.com/trees/tbt.htm
 
 I'm quite aware of the meaning of threaded in the context of data
 structures.

Oh! I did not mean to assume anything. Just wanted to make sure we are talking
of the same thing

 They usually require additional list data and operations to
 implement.
 

From the definition
A Threaded Binary Tree is a binary tree in which every node that does not have a
right child has a THREAD (in actual sense, a link) to its INORDER successor.

You use the empty pointer (missing right child), so why do we need a list. May
be I am missing something.

 Look at how the latter link you provided grows the rb_node with one
 pointer to provide a single linked list.
 

Yes, that's a bad example I suppose. Look at
http://www.cs.rutgers.edu/~kaplan/503/thread.html

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Balbir Singh wrote:
 Ingo Molnar wrote:
 * Balbir Singh [EMAIL PROTECTED] wrote:

 If you insist that sched_yield() is bad, I might agree, but how does 
 my patch make things worse. [...]
 it puts new instructions into the hotpath.

 [...] In my benchmarks, it has helped the sched_yield case, why is 
 that bad? [...]
 I had the same cache for the rightmost task in earlier CFS (it's a 
 really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
 the fastpath hence i removed it. Algorithms and implementations are a 
 constant balancing act.
 
 This is more convincing, was the code ever in git? How did you measure the
 overhead? What are your plans for reports with regressions where
 kernel.compat_sched_yield is set to 1?
 

Ingo,

I was looking through nptl/sysdeps/unix/sysv/linux/pthread_yield.c in the glibc
sources and it looks like pthread_yield() calls sched_yield(). Applications
using compat_sched_yield and pthread_yield() are also going to be impacted.

I searched for pthread_yield and sched_yield using google codesearch to look at
the applications that use these routines, the search list is too big for me to
mark applications as candidates for potential improvement.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Peter Zijlstra

On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:

 You use the empty pointer (missing right child), so why do we need a list. May
 be I am missing something.

A fully threaded tree also has back-pointer to traverse backwards
through the ordered elements.

That said, overloading the right child pointer might not be the best
thing for the linux kernel, as it will impact all the rb-tree lookups
which are open-coded and often performance critical (this is the reason
the colour isn't bit encoded in either of the child pointers either).

But if you only want a uni directional thread, I guess we can stick it
in the unsigned long we use for the node colour.

Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
threaded thing as there are also a lot of rb_prev() users in the kernel.
Who knows..

Anyway, I agree that improving rb_next() is worth looking into for the
scheduler.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Peter Zijlstra wrote:
 On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
 
 You use the empty pointer (missing right child), so why do we need a list. 
 May
 be I am missing something.
 
 A fully threaded tree also has back-pointer to traverse backwards
 through the ordered elements.
 

The back pointer is implemented using missing left children. I am referring to
Don Knuth's volume 1: The art of computer programming, section 2.3.1, page 322.
Please see the threaded representation of the tree, it is compared with the
unthreaded representation.

 That said, overloading the right child pointer might not be the best
 thing for the linux kernel, as it will impact all the rb-tree lookups
 which are open-coded and often performance critical (this is the reason
 the colour isn't bit encoded in either of the child pointers either).
 

We always look at the child, irrespective of whether it is to the right or left
to terminate our walk. Overloading them and setting a bit stating it is threaded
should not be that bad.

 But if you only want a uni directional thread, I guess we can stick it
 in the unsigned long we use for the node colour.
 
 Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
 threaded thing as there are also a lot of rb_prev() users in the kernel.
 Who knows..
 

Why does the rb_node need to grow? We can encode the bits in the children

 Anyway, I agree that improving rb_next() is worth looking into for the
 scheduler.

Sure, will experiment and see if we can bump up performance.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Mike Galbraith
Hi,

On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
 Ingo Molnar wrote:
  * Balbir Singh [EMAIL PROTECTED] wrote:
  
  If you insist that sched_yield() is bad, I might agree, but how does 
  my patch make things worse. [...]
  
  it puts new instructions into the hotpath.
  
  [...] In my benchmarks, it has helped the sched_yield case, why is 
  that bad? [...]
  
  I had the same cache for the rightmost task in earlier CFS (it's a 
  really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
  the fastpath hence i removed it. Algorithms and implementations are a 
  constant balancing act.
 
 This is more convincing, was the code ever in git? How did you measure the
 overhead?

Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds
netperf test that does ~85k/s context switches  shows:

sched_cycles: 7198444348 unpatched
vs
sched_cycles: 8574036268 patched

-Mike

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e217d18..a71edfe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1262,6 +1262,7 @@ struct task_struct {
int latency_record_count;
struct latency_record latency_record[LT_SAVECOUNT];
 #endif
+   unsigned long long sched_cycles;
 };
 
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 506a957..c5c3d5e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -174,6 +174,8 @@ repeat:
}
 
write_unlock_irq(tasklist_lock);
+   if (!strcmp(netperf, p-comm) || !strcmp(netserver, p-comm))
+   printk(KERN_WARNING %s:%d sched_cycles: %Ld\n, p-comm, 
p-pid, p-sched_cycles);
release_thread(p);
call_rcu(p-rcu, delayed_put_task_struct);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index f28f19e..d7c1b0b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1301,15 +1301,25 @@ static void set_load_weight(struct task_struct *p)
 
 static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 {
+   unsigned long long then, now;
+
+   rdtscll(then);
sched_info_queued(p);
p-sched_class-enqueue_task(rq, p, wakeup);
p-se.on_rq = 1;
+   rdtscll(now);
+   p-sched_cycles += now - then;
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
+   unsigned long long then, now;
+
+   rdtscll(then);
p-sched_class-dequeue_task(rq, p, sleep);
p-se.on_rq = 0;
+   rdtscll(now);
+   p-sched_cycles += now - then;
 }
 
 /*
@@ -2009,6 +2019,7 @@ void wake_up_new_task(struct task_struct *p, unsigned 
long clone_flags)
update_rq_clock(rq);
 
p-prio = effective_prio(p);
+   p-sched_cycles = 0;
 
if (!p-sched_class-task_new || !current-se.on_rq) {
activate_task(rq, p, 0);


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Mike Galbraith

On Thu, 2008-02-21 at 13:02 +0100, Mike Galbraith wrote:
 sched_cycles: 7198444348 unpatched
 vs
 sched_cycles: 8574036268 patched

(in case it's not clear, patched means your patch, not my quick/dirty
countem patch:)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Mike Galbraith wrote:
 Hi,
 
 On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
 Ingo Molnar wrote:
 * Balbir Singh [EMAIL PROTECTED] wrote:

 If you insist that sched_yield() is bad, I might agree, but how does 
 my patch make things worse. [...]
 it puts new instructions into the hotpath.

 [...] In my benchmarks, it has helped the sched_yield case, why is 
 that bad? [...]
 I had the same cache for the rightmost task in earlier CFS (it's a 
 really obvious thing) but removed it. It wasnt a bad idea, but it hurt 
 the fastpath hence i removed it. Algorithms and implementations are a 
 constant balancing act.
 This is more convincing, was the code ever in git? How did you measure the
 overhead?
 
 Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds
 netperf test that does ~85k/s context switches  shows:
 
 sched_cycles: 7198444348 unpatched
 vs
 sched_cycles: 8574036268 patched


Thanks for the numbers! I am very convinced that the patch should stay out until
we can find a way to reduce the overhead. I'll try your patch and see what the
numbers look like as well.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Jörn Engel
On Thu, 21 February 2008 12:21:33 +0530, Balbir Singh wrote:
 
 For a large number of tasks - say 1, we need to walk 14 levels before we

16.7, actually.  rbtrees are balanced treed, but they are not balanced
binary trees.  The average fan-out is sqrt(3) instead of 2.

Jörn

-- 
The cost of changing business rules is much more expensive for software
than for a secretaty.
-- unknown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Jens Axboe
On Thu, Feb 21 2008, Peter Zijlstra wrote:
 
 On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
 
  You use the empty pointer (missing right child), so why do we need a list. 
  May
  be I am missing something.
 
 A fully threaded tree also has back-pointer to traverse backwards
 through the ordered elements.
 
 That said, overloading the right child pointer might not be the best
 thing for the linux kernel, as it will impact all the rb-tree lookups
 which are open-coded and often performance critical (this is the reason
 the colour isn't bit encoded in either of the child pointers either).
 
 But if you only want a uni directional thread, I guess we can stick it
 in the unsigned long we use for the node colour.
 
 Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
 threaded thing as there are also a lot of rb_prev() users in the kernel.
 Who knows..
 
 Anyway, I agree that improving rb_next() is worth looking into for the
 scheduler.

For the IO scheduler as well, it's used quite extensively! So speeding
up rb_next() would definitely help, as it's typically invoked for every
bio queued (attempting to back merge with the next request). CFQ and AS
additionally does an rb_next() and rb_prev() when trying to decide which
request to do next.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Jens Axboe
On Thu, Feb 21 2008, Jens Axboe wrote:
 On Thu, Feb 21 2008, Peter Zijlstra wrote:
  
  On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
  
   You use the empty pointer (missing right child), so why do we need a 
   list. May
   be I am missing something.
  
  A fully threaded tree also has back-pointer to traverse backwards
  through the ordered elements.
  
  That said, overloading the right child pointer might not be the best
  thing for the linux kernel, as it will impact all the rb-tree lookups
  which are open-coded and often performance critical (this is the reason
  the colour isn't bit encoded in either of the child pointers either).
  
  But if you only want a uni directional thread, I guess we can stick it
  in the unsigned long we use for the node colour.
  
  Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
  threaded thing as there are also a lot of rb_prev() users in the kernel.
  Who knows..
  
  Anyway, I agree that improving rb_next() is worth looking into for the
  scheduler.
 
 For the IO scheduler as well, it's used quite extensively! So speeding
 up rb_next() would definitely help, as it's typically invoked for every
 bio queued (attempting to back merge with the next request). CFQ and AS
 additionally does an rb_next() and rb_prev() when trying to decide which
 request to do next.

One possible course of action to implement this without eating extra
space in the rb_node would be:

- Add rb_right() and rb_set_right() (plus ditto _left variants) to
  rbtree.h
- Convert all in-kernel users to use these. Quite extensive, as the
  rbtree code search/insert functions are coded in situ and not in
  rbtree.[ch]
- Now we can overload bit 0 of -rb_right and -rb_left to indicate
  whether this is a node or thread pointer and modify rbtree.c to tag
  and add the thread links when appropriate.

So we can definitely do this in a compatible fashion. Given that I have
a flight coming up in a few days time, I may give it a got if no one
beats me to it :-)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-21 Thread Balbir Singh
Jens Axboe wrote:
 On Thu, Feb 21 2008, Jens Axboe wrote:
 On Thu, Feb 21 2008, Peter Zijlstra wrote:
 On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:

 You use the empty pointer (missing right child), so why do we need a list. 
 May
 be I am missing something.
 A fully threaded tree also has back-pointer to traverse backwards
 through the ordered elements.

 That said, overloading the right child pointer might not be the best
 thing for the linux kernel, as it will impact all the rb-tree lookups
 which are open-coded and often performance critical (this is the reason
 the colour isn't bit encoded in either of the child pointers either).

 But if you only want a uni directional thread, I guess we can stick it
 in the unsigned long we use for the node colour.

 Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
 threaded thing as there are also a lot of rb_prev() users in the kernel.
 Who knows..

 Anyway, I agree that improving rb_next() is worth looking into for the
 scheduler.
 For the IO scheduler as well, it's used quite extensively! So speeding
 up rb_next() would definitely help, as it's typically invoked for every
 bio queued (attempting to back merge with the next request). CFQ and AS
 additionally does an rb_next() and rb_prev() when trying to decide which
 request to do next.
 
 One possible course of action to implement this without eating extra
 space in the rb_node would be:
 
 - Add rb_right() and rb_set_right() (plus ditto _left variants) to
   rbtree.h
 - Convert all in-kernel users to use these. Quite extensive, as the
   rbtree code search/insert functions are coded in situ and not in
   rbtree.[ch]
 - Now we can overload bit 0 of -rb_right and -rb_left to indicate
   whether this is a node or thread pointer and modify rbtree.c to tag
   and add the thread links when appropriate.
 

Exactly along the lines I was thinking of.and discussing with David.

 So we can definitely do this in a compatible fashion. Given that I have
 a flight coming up in a few days time, I may give it a got if no one
 beats me to it :-)
 

Feel free to do so, please do keep me on the cc. I am very interested in getting
rb threaded trees done, but my bandwidth is a little limited this month.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-20 Thread Balbir Singh
Ingo Molnar wrote:
> * Balbir Singh <[EMAIL PROTECTED]> wrote:
> 
>> I disagree. The cost is only adding a field to cfs_rq [...]
> 
> wrong. The cost is "only" of adding a field to cfs_rq and _updating it_, 
> in the hottest paths of the scheduler:
> 
> @@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_
>  */
> if (key < entity_key(cfs_rq, entry)) {
> link = >rb_left;
> +   rightmost = 0;

That's an update when we move leftwards.

> } else {
> link = >rb_right;
> leftmost = 0;
> @@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_
>  */
> if (leftmost)
> cfs_rq->rb_leftmost = >run_node;
> +   if (rightmost)
> +   cfs_rq->rb_rightmost = >run_node;
> 

>run_node is already in the cache, we are assigning cfs_rq->rb_rightmost to 
it.

>> [...] For a large number of tasks - say 1, we need to walk 14 
>> levels before we reach the node (each time). [...]
> 
> 10,000 yield-ing tasks is not a common workload we care about. It's not 
> even a rare workload we care about. _Especially_ we dont care about it 
> if it slows down every other workload (a tiny bit).
> 

sched_yield() is supported API and also look at
http://lkml.org/lkml/2007/9/19/351. I am trying to make sched_yield() efficient
when compat_sched_yield is turned on (which is most likely), since people will
want that behaviour (Hint, please read the man-page for sched_yield).There are
already several applications using sched_yield(), so they all suffer.

>> [...] Doesn't matter if the data is cached, we are still spending CPU 
>> time looking through pointers and walking to the right node. [...]
> 
> have you actually measured how much it takes to walk the tree that deep
> on recent hardware? I have.

I have measured how much time can be saved by not doing that and it's quite a 
lot.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-20 Thread Ingo Molnar

* Balbir Singh <[EMAIL PROTECTED]> wrote:

> I disagree. The cost is only adding a field to cfs_rq [...]

wrong. The cost is "only" of adding a field to cfs_rq and _updating it_, 
in the hottest paths of the scheduler:

@@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_
 */
if (key < entity_key(cfs_rq, entry)) {
link = >rb_left;
+   rightmost = 0;
} else {
link = >rb_right;
leftmost = 0;
@@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_
 */
if (leftmost)
cfs_rq->rb_leftmost = >run_node;
+   if (rightmost)
+   cfs_rq->rb_rightmost = >run_node;

> [...] For a large number of tasks - say 1, we need to walk 14 
> levels before we reach the node (each time). [...]

10,000 yield-ing tasks is not a common workload we care about. It's not 
even a rare workload we care about. _Especially_ we dont care about it 
if it slows down every other workload (a tiny bit).

> [...] Doesn't matter if the data is cached, we are still spending CPU 
> time looking through pointers and walking to the right node. [...]

have you actually measured how much it takes to walk the tree that deep
on recent hardware? I have.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-20 Thread Balbir Singh
Ingo Molnar wrote:
> * Balbir Singh <[EMAIL PROTECTED]> wrote:
> 
>> __pick_last_entity() walks the entire tree on O(lgn) time to find the 
>> rightmost entry. This patch makes the routine more efficient by 
>> reducing the cost of the lookup
> 
> hm, i'm not sure we want to do this: we'd be slowing down the fastpath 
> of all the other common scheduler functions, for the sake of a rarely 
> used (and broken ...) API: yield. And note that an rbtree walk is not 
> slow at all - if you are yielding frequently it's probably all cached.
> 
>   Ingo

Ingo,

I disagree. The cost is only adding a field to cfs_rq and we already have the
logic to track the leftmost node, we just update the rightmost node as well.
For a large number of tasks - say 1, we need to walk 14 levels before we
reach the node (each time). Doesn't matter if the data is cached, we are still
spending CPU time looking through pointers and walking to the right node. If all
the threads get a chance to run, you can imagine the effect it has on efficiency
and the data cache.

I see a boost in sched_yield performance and no big hit on regular performance.
Could you please test it to see if your apprehensions are correct.

PS: You missed to add me on the cc/to list.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-20 Thread Ingo Molnar

* Balbir Singh <[EMAIL PROTECTED]> wrote:

> __pick_last_entity() walks the entire tree on O(lgn) time to find the 
> rightmost entry. This patch makes the routine more efficient by 
> reducing the cost of the lookup

hm, i'm not sure we want to do this: we'd be slowing down the fastpath 
of all the other common scheduler functions, for the sake of a rarely 
used (and broken ...) API: yield. And note that an rbtree walk is not 
slow at all - if you are yielding frequently it's probably all cached.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-20 Thread Ingo Molnar

* Balbir Singh [EMAIL PROTECTED] wrote:

 __pick_last_entity() walks the entire tree on O(lgn) time to find the 
 rightmost entry. This patch makes the routine more efficient by 
 reducing the cost of the lookup

hm, i'm not sure we want to do this: we'd be slowing down the fastpath 
of all the other common scheduler functions, for the sake of a rarely 
used (and broken ...) API: yield. And note that an rbtree walk is not 
slow at all - if you are yielding frequently it's probably all cached.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-20 Thread Balbir Singh
Ingo Molnar wrote:
 * Balbir Singh [EMAIL PROTECTED] wrote:
 
 __pick_last_entity() walks the entire tree on O(lgn) time to find the 
 rightmost entry. This patch makes the routine more efficient by 
 reducing the cost of the lookup
 
 hm, i'm not sure we want to do this: we'd be slowing down the fastpath 
 of all the other common scheduler functions, for the sake of a rarely 
 used (and broken ...) API: yield. And note that an rbtree walk is not 
 slow at all - if you are yielding frequently it's probably all cached.
 
   Ingo

Ingo,

I disagree. The cost is only adding a field to cfs_rq and we already have the
logic to track the leftmost node, we just update the rightmost node as well.
For a large number of tasks - say 1, we need to walk 14 levels before we
reach the node (each time). Doesn't matter if the data is cached, we are still
spending CPU time looking through pointers and walking to the right node. If all
the threads get a chance to run, you can imagine the effect it has on efficiency
and the data cache.

I see a boost in sched_yield performance and no big hit on regular performance.
Could you please test it to see if your apprehensions are correct.

PS: You missed to add me on the cc/to list.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-20 Thread Ingo Molnar

* Balbir Singh [EMAIL PROTECTED] wrote:

 I disagree. The cost is only adding a field to cfs_rq [...]

wrong. The cost is only of adding a field to cfs_rq and _updating it_, 
in the hottest paths of the scheduler:

@@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_
 */
if (key  entity_key(cfs_rq, entry)) {
link = parent-rb_left;
+   rightmost = 0;
} else {
link = parent-rb_right;
leftmost = 0;
@@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_
 */
if (leftmost)
cfs_rq-rb_leftmost = se-run_node;
+   if (rightmost)
+   cfs_rq-rb_rightmost = se-run_node;

 [...] For a large number of tasks - say 1, we need to walk 14 
 levels before we reach the node (each time). [...]

10,000 yield-ing tasks is not a common workload we care about. It's not 
even a rare workload we care about. _Especially_ we dont care about it 
if it slows down every other workload (a tiny bit).

 [...] Doesn't matter if the data is cached, we are still spending CPU 
 time looking through pointers and walking to the right node. [...]

have you actually measured how much it takes to walk the tree that deep
on recent hardware? I have.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make yield_task_fair more efficient

2008-02-20 Thread Balbir Singh
Ingo Molnar wrote:
 * Balbir Singh [EMAIL PROTECTED] wrote:
 
 I disagree. The cost is only adding a field to cfs_rq [...]
 
 wrong. The cost is only of adding a field to cfs_rq and _updating it_, 
 in the hottest paths of the scheduler:
 
 @@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_
  */
 if (key  entity_key(cfs_rq, entry)) {
 link = parent-rb_left;
 +   rightmost = 0;

That's an update when we move leftwards.

 } else {
 link = parent-rb_right;
 leftmost = 0;
 @@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_
  */
 if (leftmost)
 cfs_rq-rb_leftmost = se-run_node;
 +   if (rightmost)
 +   cfs_rq-rb_rightmost = se-run_node;
 

se-run_node is already in the cache, we are assigning cfs_rq-rb_rightmost to 
it.

 [...] For a large number of tasks - say 1, we need to walk 14 
 levels before we reach the node (each time). [...]
 
 10,000 yield-ing tasks is not a common workload we care about. It's not 
 even a rare workload we care about. _Especially_ we dont care about it 
 if it slows down every other workload (a tiny bit).
 

sched_yield() is supported API and also look at
http://lkml.org/lkml/2007/9/19/351. I am trying to make sched_yield() efficient
when compat_sched_yield is turned on (which is most likely), since people will
want that behaviour (Hint, please read the man-page for sched_yield).There are
already several applications using sched_yield(), so they all suffer.

 [...] Doesn't matter if the data is cached, we are still spending CPU 
 time looking through pointers and walking to the right node. [...]
 
 have you actually measured how much it takes to walk the tree that deep
 on recent hardware? I have.

I have measured how much time can be saved by not doing that and it's quite a 
lot.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/