Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-06 Thread Jan Engelhardt

On Apr 6 2007 07:05, Hugh Dickins wrote:
>On Fri, 6 Apr 2007, Nick Piggin wrote:
>> 
>> I like |= for adding flags, it seems less ambiguous. But I guess that's
>> a matter of opinion. Hugh seems to like +=,
>
>Do I?  You probably have a shaming example in mind (PAGE_MAPPING_ANON?
>that's a hybrid case where using + and - helped minimize the casting);
>but in general I'd agree with you that it's |= for setting flag bits.
>
>Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the +=

If a bit is already set, you can't set it again using +=.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-06 Thread Peter Zijlstra
On Thu, 2007-04-05 at 19:49 +0200, Eric Dumazet wrote:
> Hi all
> 
> I'm pleased to present this patch which improves linux futexes performance 
> and 
> scalability, merely avoiding taking mmap_sem rwlock.
> 
> Ulrich agreed with the API and said glibc work could start as soon
> as he gets a Fedora kernel with it :)
> 
> Andrew, could we get this in mm as well ? This version is against 
> 2.6.21-rc5-mm4
> (so supports 64bit futexes)
> 
> In this third version I dropped the NUMA optims and process private hash 
> table,
> to let new API come in and be tested.

Good work, Thanks!

Acked-by: Peter Zijlstra <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-06 Thread Nick Piggin

Eric Dumazet wrote:

Nick Piggin a écrit :



Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current->mm->mmap_sem, so it doesn't seem like
a boolean flag would hurt?



That's a good question

current->mm->mmap_sem being calculated once is a win in itself, because 
current access is not cheap.
It also does the memory access to go through part of the chain in 
advance, before its use. It does a prefetch() equivalent for free : If 
current->mm is not in CPU cache, CPU wont stall because next 
instructions dont depend on it.


Fair enough. Current access I think should be cheap though (it is
effectively a constant), but I guess it is still improvement.


Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
should depend on the size of the operation. Maybe the access_ok check
should go outside get_futex_key?



If you check again, you'll see that address points to the start of the 
PAGE, not the real u32/u64 futex address. This checks the PAGE. We can 
use char, short, int, long, or char[PAGE_SIZE] as long as we know a 
futex cannot span two pages.


Ah, that works.


  */
 key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-key->both.offset++; /* Bit 0 of offset indicates inode-based 
key. */

+key->both.offset += FUT_OFF_INODE; /* inode-based key. */
 if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
 key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
  + vma->vm_pgoff);



I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=, and I can't argue with him
about style issues ;)




Previous code was doing offset++ wich means offset += 1;


But it doesn't mean you have to ;)


@@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
 restart->arg1 = val;
 restart->arg2 = (unsigned long)abs_time;
 restart->arg3 = (unsigned long)futex64;
+if (shared)
+restart->arg3 |= 2;



Could you make this into a proper flags argument and use #define 
CONSTANTs for it?



Yes, but I'm not sure it will improve readability.


Well that bit of code alone is obviously unreadable.

restart->arg3 = 0;
if (futex64)
restart->arg3 |= FUTEX_64;
if (shared)
restart->arg3 |= FUTEX_SHARED;

Maybe a matter of taste.






@@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
 struct timespec ts;
 ktime_t t, *tp = NULL;
 u64 val2 = 0;
+int opm = op & FUTEX_CMD_MASK;



What's opm stand for?



I guess 'm' stands for 'mask' or 'masked' ?


Why not call it cmd? (ie. what it is, rather than what you have done
to derive it).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-06 Thread Hugh Dickins
On Fri, 6 Apr 2007, Nick Piggin wrote:
> 
> I like |= for adding flags, it seems less ambiguous. But I guess that's
> a matter of opinion. Hugh seems to like +=,

Do I?  You probably have a shaming example in mind (PAGE_MAPPING_ANON?
that's a hybrid case where using + and - helped minimize the casting);
but in general I'd agree with you that it's |= for setting flag bits.

Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the +=

> and I can't argue with him about style issues ;)

I feel a warm glow.  Nobody ever called me a style guru before.

Hugh

p.s.  Please don't interpret this as any useful contribution to
reviewing Eric's futex work: seems sensible, but I've hardly looked.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-06 Thread Hugh Dickins
On Fri, 6 Apr 2007, Nick Piggin wrote:
 
 I like |= for adding flags, it seems less ambiguous. But I guess that's
 a matter of opinion. Hugh seems to like +=,

Do I?  You probably have a shaming example in mind (PAGE_MAPPING_ANON?
that's a hybrid case where using + and - helped minimize the casting);
but in general I'd agree with you that it's |= for setting flag bits.

Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the +=

 and I can't argue with him about style issues ;)

I feel a warm glow.  Nobody ever called me a style guru before.

Hugh

p.s.  Please don't interpret this as any useful contribution to
reviewing Eric's futex work: seems sensible, but I've hardly looked.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-06 Thread Nick Piggin

Eric Dumazet wrote:

Nick Piggin a écrit :



Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current-mm-mmap_sem, so it doesn't seem like
a boolean flag would hurt?



That's a good question

current-mm-mmap_sem being calculated once is a win in itself, because 
current access is not cheap.
It also does the memory access to go through part of the chain in 
advance, before its use. It does a prefetch() equivalent for free : If 
current-mm is not in CPU cache, CPU wont stall because next 
instructions dont depend on it.


Fair enough. Current access I think should be cheap though (it is
effectively a constant), but I guess it is still improvement.


Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
should depend on the size of the operation. Maybe the access_ok check
should go outside get_futex_key?



If you check again, you'll see that address points to the start of the 
PAGE, not the real u32/u64 futex address. This checks the PAGE. We can 
use char, short, int, long, or char[PAGE_SIZE] as long as we know a 
futex cannot span two pages.


Ah, that works.


  */
 key-shared.inode = vma-vm_file-f_path.dentry-d_inode;
-key-both.offset++; /* Bit 0 of offset indicates inode-based 
key. */

+key-both.offset += FUT_OFF_INODE; /* inode-based key. */
 if (likely(!(vma-vm_flags  VM_NONLINEAR))) {
 key-shared.pgoff = (((address - vma-vm_start)  PAGE_SHIFT)
  + vma-vm_pgoff);



I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=, and I can't argue with him
about style issues ;)




Previous code was doing offset++ wich means offset += 1;


But it doesn't mean you have to ;)


@@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
 restart-arg1 = val;
 restart-arg2 = (unsigned long)abs_time;
 restart-arg3 = (unsigned long)futex64;
+if (shared)
+restart-arg3 |= 2;



Could you make this into a proper flags argument and use #define 
CONSTANTs for it?



Yes, but I'm not sure it will improve readability.


Well that bit of code alone is obviously unreadable.

restart-arg3 = 0;
if (futex64)
restart-arg3 |= FUTEX_64;
if (shared)
restart-arg3 |= FUTEX_SHARED;

Maybe a matter of taste.






@@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
 struct timespec ts;
 ktime_t t, *tp = NULL;
 u64 val2 = 0;
+int opm = op  FUTEX_CMD_MASK;



What's opm stand for?



I guess 'm' stands for 'mask' or 'masked' ?


Why not call it cmd? (ie. what it is, rather than what you have done
to derive it).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-06 Thread Peter Zijlstra
On Thu, 2007-04-05 at 19:49 +0200, Eric Dumazet wrote:
 Hi all
 
 I'm pleased to present this patch which improves linux futexes performance 
 and 
 scalability, merely avoiding taking mmap_sem rwlock.
 
 Ulrich agreed with the API and said glibc work could start as soon
 as he gets a Fedora kernel with it :)
 
 Andrew, could we get this in mm as well ? This version is against 
 2.6.21-rc5-mm4
 (so supports 64bit futexes)
 
 In this third version I dropped the NUMA optims and process private hash 
 table,
 to let new API come in and be tested.

Good work, Thanks!

Acked-by: Peter Zijlstra [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-06 Thread Jan Engelhardt

On Apr 6 2007 07:05, Hugh Dickins wrote:
On Fri, 6 Apr 2007, Nick Piggin wrote:
 
 I like |= for adding flags, it seems less ambiguous. But I guess that's
 a matter of opinion. Hugh seems to like +=,

Do I?  You probably have a shaming example in mind (PAGE_MAPPING_ANON?
that's a hybrid case where using + and - helped minimize the casting);
but in general I'd agree with you that it's |= for setting flag bits.

Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the +=

If a bit is already set, you can't set it again using +=.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-05 Thread Eric Dumazet

Nick Piggin a écrit :

Hi Eric,

Thanks for doing this... It's looking good, I just have some minor
comments:


Hi Nick, thanks for reviewing.



Eric Dumazet wrote:

  */
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, union futex_key *key,
+struct rw_semaphore *shared)


Can we pass in something other than the rw_semaphore here? Seeing as
it only actually gets used as a flag, it might be nicer just to pass
a 0 or 1? And all through the call stack...

Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current->mm->mmap_sem, so it doesn't seem like
a boolean flag would hurt?


That's a good question

current->mm->mmap_sem being calculated once is a win in itself, because 
current access is not cheap.
It also does the memory access to go through part of the chain in advance, 
before its use. It does a prefetch() equivalent for free : If current->mm is 
not in CPU cache, CPU wont stall because next instructions dont depend on it.


This means less CPU stall in case current->mm is not in CPU cache. Thats 
difficult to benchmark it, but you can trust me.


A flag means :

if (flag)
up_read(>mm->mmap_sem)

This generates quite a bad code.

if (ptr)
   up_read(ptr)

generates *much* better code.

So this is a cleanup and a runtime optimization.

I dit a similar optimization on commit 163da958ba5282cbf85e8b3dc08e4f51f8b01c5e

I invite you to check it :

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=163da958ba5282cbf85e8b3dc08e4f51f8b01c5e






 {
 unsigned long address = (unsigned long)uaddr;
 struct mm_struct *mm = current->mm;
@@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
 address -= key->both.offset;
 
 /*

+ * PROCESS_PRIVATE futexes are fast.
+ * As the mm cannot disappear under us and the 'key' only needs
+ * virtual address, we dont even have to find the underlying vma.
+ * Note : We do have to check 'address' is a valid user address,
+ *but access_ok() should be faster than find_vma()
+ * Note : At this point, address points to the start of page,
+ *not the real futex address, this is ok.
+ */
+if (!shared) {
+if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+return -EFAULT;


Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
should depend on the size of the operation. Maybe the access_ok check
should go outside get_futex_key?


If you check again, you'll see that address points to the start of the PAGE, 
not the real u32/u64 futex address. This checks the PAGE. We can use char, 
short, int, long, or char[PAGE_SIZE] as long as we know a futex cannot span 
two pages.




  */
 key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+key->both.offset += FUT_OFF_INODE; /* inode-based key. */
 if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
 key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
  + vma->vm_pgoff);


I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=, and I can't argue with him
about style issues ;)



Previous code was doing offset++ wich means offset += 1;
I didnt want to hurt Hugh :)


 EXPORT_SYMBOL_GPL(drop_futex_key_refs);


I wonder if it would be worthwhile inlining and likley()ing the
private fastpath? Might make it pretty compact... I guess that's
something to worry about after glibc gets support.


Yes, in a future patch, in about one year :)


+
+if (!(vma = find_vma(mm, address)) ||
+vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+ret = -EFAULT;
+
+else
+switch (handle_mm_fault(mm, vma, address, 1)) {
+case VM_FAULT_MINOR:
+current->min_flt++;
+break;
+case VM_FAULT_MAJOR:
+current->maj_flt++;
+break;
+default:
+ret = -EFAULT;
+}
+if (!shared)
+up_read(>mmap_sem);
+return ret;
 }
 
 /*


You've got an extra space after the if (maybe for clarity?). In this
situation I prefer putting braces around both the if and the else, and
if you get rid of that blank line, it doesn't cost you anything more ;)


Oh well...




@@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
 restart->arg1 = val;
 restart->arg2 = (unsigned long)abs_time;
 restart->arg3 = (unsigned long)futex64;
+if (shared)
+restart->arg3 |= 2;


Could you make this into a proper flags argument and use #define 
CONSTANTs for it?


Yes, but I'm not sure it will improve readability.




@@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
 struct timespec ts;
 ktime_t t, *tp = NULL;
 u64 val2 = 0;
+int opm = op & FUTEX_CMD_MASK;


What's opm stand for?



Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-05 Thread Nick Piggin

Hi Eric,

Thanks for doing this... It's looking good, I just have some minor
comments:

Eric Dumazet wrote:


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>



--- linux-2.6.21-rc5-mm4/kernel/futex.c
+++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c
@@ -16,6 +16,9 @@
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <[EMAIL PROTECTED]>
  *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner <[EMAIL PROTECTED]>
  *
+ *  PRIVATE futexes by Eric Dumazet
+ *  Copyright (C) 2007 Eric Dumazet <[EMAIL PROTECTED]>
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -199,9 +202,12 @@ static inline int match_futex(union fute
  * Returns: 0, or negative error code.
  * The key words are stored in *key on success.
  *
- * Should be called with >mm->mmap_sem but NOT any spinlocks.
+ * shared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to >mm->mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
  */
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, union futex_key *key,
+   struct rw_semaphore *shared)


Can we pass in something other than the rw_semaphore here? Seeing as
it only actually gets used as a flag, it might be nicer just to pass
a 0 or 1? And all through the call stack...

Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current->mm->mmap_sem, so it doesn't seem like
a boolean flag would hurt?


 {
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
@@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
address -= key->both.offset;
 
 	/*

+* PROCESS_PRIVATE futexes are fast.
+* As the mm cannot disappear under us and the 'key' only needs
+* virtual address, we dont even have to find the underlying vma.
+* Note : We do have to check 'address' is a valid user address,
+*but access_ok() should be faster than find_vma()
+* Note : At this point, address points to the start of page,
+*not the real futex address, this is ok.
+*/
+   if (!shared) {
+   if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+   return -EFAULT;


Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
should depend on the size of the operation. Maybe the access_ok check
should go outside get_futex_key?



+   key->private.mm = mm;
+   key->private.address = address;
+   return 0;
+   }
+   /*
 * The futex is hashed differently depending on whether
 * it's in a shared or private mapping.  So check vma first.
 */
@@ -244,6 +266,7 @@ int get_futex_key(void __user *uaddr, un
 * mappings of _writable_ handles.
 */
if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+   key->both.offset += FUT_OFF_MMSHARED; /* reference taken on mm 
*/
key->private.mm = mm;
key->private.address = address;
return 0;
@@ -253,7 +276,7 @@ int get_futex_key(void __user *uaddr, un
 * Linear file mappings are also simple.
 */
key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-   key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+   key->both.offset += FUT_OFF_INODE; /* inode-based key. */
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
 + vma->vm_pgoff);


I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=, and I can't argue with him
about style issues ;)


@@ -281,17 +304,19 @@ EXPORT_SYMBOL_GPL(get_futex_key);
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
- * function, if it is called at all.  mmap_sem keeps key->shared.inode valid.
  */
 inline void get_futex_key_refs(union futex_key *key)
 {
-   if (key->both.ptr != 0) {
-   if (key->both.offset & 1)
+   if (key->both.ptr == 0)
+   return;
+   switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+   case FUT_OFF_INODE:
atomic_inc(>shared.inode->i_count);
-   else
+   break;
+   case FUT_OFF_MMSHARED:
atomic_inc(>private.mm->mm_count);
-   }
+   break;
+   }
 }
 EXPORT_SYMBOL_GPL(get_futex_key_refs);
 
@@ -301,11 +326,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);

  */
 void drop_futex_key_refs(union futex_key *key)
 {
-   if (key->both.ptr != 0) {
-  

Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-05 Thread Ulrich Drepper

On 4/5/07, Eric Dumazet <[EMAIL PROTECTED]> wrote:

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

---
 include/linux/futex.h |   45 +-
 kernel/futex.c|  294 +---
 2 files changed, 230 insertions(+), 109 deletions(-)


I cannot vouch for the whole code but the concept is very sound and
definitely along the way the specification allows it:

Acked-by: Ulrich Drepper <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-05 Thread Ulrich Drepper

On 4/5/07, Eric Dumazet [EMAIL PROTECTED] wrote:

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]

---
 include/linux/futex.h |   45 +-
 kernel/futex.c|  294 +---
 2 files changed, 230 insertions(+), 109 deletions(-)


I cannot vouch for the whole code but the concept is very sound and
definitely along the way the specification allows it:

Acked-by: Ulrich Drepper [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-05 Thread Nick Piggin

Hi Eric,

Thanks for doing this... It's looking good, I just have some minor
comments:

Eric Dumazet wrote:


Signed-off-by: Eric Dumazet [EMAIL PROTECTED]



--- linux-2.6.21-rc5-mm4/kernel/futex.c
+++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c
@@ -16,6 +16,9 @@
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar [EMAIL PROTECTED]
  *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner [EMAIL PROTECTED]
  *
+ *  PRIVATE futexes by Eric Dumazet
+ *  Copyright (C) 2007 Eric Dumazet [EMAIL PROTECTED]
+ *
  *  Thanks to Ben LaHaise for yelling hashed waitqueues loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -199,9 +202,12 @@ static inline int match_futex(union fute
  * Returns: 0, or negative error code.
  * The key words are stored in *key on success.
  *
- * Should be called with current-mm-mmap_sem but NOT any spinlocks.
+ * shared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to current-mm-mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
  */
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, union futex_key *key,
+   struct rw_semaphore *shared)


Can we pass in something other than the rw_semaphore here? Seeing as
it only actually gets used as a flag, it might be nicer just to pass
a 0 or 1? And all through the call stack...

Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current-mm-mmap_sem, so it doesn't seem like
a boolean flag would hurt?


 {
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current-mm;
@@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
address -= key-both.offset;
 
 	/*

+* PROCESS_PRIVATE futexes are fast.
+* As the mm cannot disappear under us and the 'key' only needs
+* virtual address, we dont even have to find the underlying vma.
+* Note : We do have to check 'address' is a valid user address,
+*but access_ok() should be faster than find_vma()
+* Note : At this point, address points to the start of page,
+*not the real futex address, this is ok.
+*/
+   if (!shared) {
+   if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+   return -EFAULT;


Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
should depend on the size of the operation. Maybe the access_ok check
should go outside get_futex_key?



+   key-private.mm = mm;
+   key-private.address = address;
+   return 0;
+   }
+   /*
 * The futex is hashed differently depending on whether
 * it's in a shared or private mapping.  So check vma first.
 */
@@ -244,6 +266,7 @@ int get_futex_key(void __user *uaddr, un
 * mappings of _writable_ handles.
 */
if (likely(!(vma-vm_flags  VM_MAYSHARE))) {
+   key-both.offset += FUT_OFF_MMSHARED; /* reference taken on mm 
*/
key-private.mm = mm;
key-private.address = address;
return 0;
@@ -253,7 +276,7 @@ int get_futex_key(void __user *uaddr, un
 * Linear file mappings are also simple.
 */
key-shared.inode = vma-vm_file-f_path.dentry-d_inode;
-   key-both.offset++; /* Bit 0 of offset indicates inode-based key. */
+   key-both.offset += FUT_OFF_INODE; /* inode-based key. */
if (likely(!(vma-vm_flags  VM_NONLINEAR))) {
key-shared.pgoff = (((address - vma-vm_start)  PAGE_SHIFT)
 + vma-vm_pgoff);


I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=, and I can't argue with him
about style issues ;)


@@ -281,17 +304,19 @@ EXPORT_SYMBOL_GPL(get_futex_key);
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
- * function, if it is called at all.  mmap_sem keeps key-shared.inode valid.
  */
 inline void get_futex_key_refs(union futex_key *key)
 {
-   if (key-both.ptr != 0) {
-   if (key-both.offset  1)
+   if (key-both.ptr == 0)
+   return;
+   switch (key-both.offset  (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+   case FUT_OFF_INODE:
atomic_inc(key-shared.inode-i_count);
-   else
+   break;
+   case FUT_OFF_MMSHARED:
atomic_inc(key-private.mm-mm_count);
-   }
+   break;
+   }
 }
 EXPORT_SYMBOL_GPL(get_futex_key_refs);
 
@@ -301,11 +326,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);

  */
 void drop_futex_key_refs(union futex_key *key)
 {
-   if (key-both.ptr != 0) {
-   if (key-both.offset  

Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-05 Thread Eric Dumazet

Nick Piggin a écrit :

Hi Eric,

Thanks for doing this... It's looking good, I just have some minor
comments:


Hi Nick, thanks for reviewing.



Eric Dumazet wrote:

  */
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, union futex_key *key,
+struct rw_semaphore *shared)


Can we pass in something other than the rw_semaphore here? Seeing as
it only actually gets used as a flag, it might be nicer just to pass
a 0 or 1? And all through the call stack...

Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current-mm-mmap_sem, so it doesn't seem like
a boolean flag would hurt?


That's a good question

current-mm-mmap_sem being calculated once is a win in itself, because 
current access is not cheap.
It also does the memory access to go through part of the chain in advance, 
before its use. It does a prefetch() equivalent for free : If current-mm is 
not in CPU cache, CPU wont stall because next instructions dont depend on it.


This means less CPU stall in case current-mm is not in CPU cache. Thats 
difficult to benchmark it, but you can trust me.


A flag means :

if (flag)
up_read(current-mm-mmap_sem)

This generates quite a bad code.

if (ptr)
   up_read(ptr)

generates *much* better code.

So this is a cleanup and a runtime optimization.

I dit a similar optimization on commit 163da958ba5282cbf85e8b3dc08e4f51f8b01c5e

I invite you to check it :

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=163da958ba5282cbf85e8b3dc08e4f51f8b01c5e






 {
 unsigned long address = (unsigned long)uaddr;
 struct mm_struct *mm = current-mm;
@@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
 address -= key-both.offset;
 
 /*

+ * PROCESS_PRIVATE futexes are fast.
+ * As the mm cannot disappear under us and the 'key' only needs
+ * virtual address, we dont even have to find the underlying vma.
+ * Note : We do have to check 'address' is a valid user address,
+ *but access_ok() should be faster than find_vma()
+ * Note : At this point, address points to the start of page,
+ *not the real futex address, this is ok.
+ */
+if (!shared) {
+if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+return -EFAULT;


Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
should depend on the size of the operation. Maybe the access_ok check
should go outside get_futex_key?


If you check again, you'll see that address points to the start of the PAGE, 
not the real u32/u64 futex address. This checks the PAGE. We can use char, 
short, int, long, or char[PAGE_SIZE] as long as we know a futex cannot span 
two pages.




  */
 key-shared.inode = vma-vm_file-f_path.dentry-d_inode;
-key-both.offset++; /* Bit 0 of offset indicates inode-based key. */
+key-both.offset += FUT_OFF_INODE; /* inode-based key. */
 if (likely(!(vma-vm_flags  VM_NONLINEAR))) {
 key-shared.pgoff = (((address - vma-vm_start)  PAGE_SHIFT)
  + vma-vm_pgoff);


I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=, and I can't argue with him
about style issues ;)



Previous code was doing offset++ wich means offset += 1;
I didnt want to hurt Hugh :)


 EXPORT_SYMBOL_GPL(drop_futex_key_refs);


I wonder if it would be worthwhile inlining and likley()ing the
private fastpath? Might make it pretty compact... I guess that's
something to worry about after glibc gets support.


Yes, in a future patch, in about one year :)


+
+if (!(vma = find_vma(mm, address)) ||
+vma-vm_start  address || !(vma-vm_flags  VM_WRITE))
+ret = -EFAULT;
+
+else
+switch (handle_mm_fault(mm, vma, address, 1)) {
+case VM_FAULT_MINOR:
+current-min_flt++;
+break;
+case VM_FAULT_MAJOR:
+current-maj_flt++;
+break;
+default:
+ret = -EFAULT;
+}
+if (!shared)
+up_read(mm-mmap_sem);
+return ret;
 }
 
 /*


You've got an extra space after the if (maybe for clarity?). In this
situation I prefer putting braces around both the if and the else, and
if you get rid of that blank line, it doesn't cost you anything more ;)


Oh well...




@@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
 restart-arg1 = val;
 restart-arg2 = (unsigned long)abs_time;
 restart-arg3 = (unsigned long)futex64;
+if (shared)
+restart-arg3 |= 2;


Could you make this into a proper flags argument and use #define 
CONSTANTs for it?


Yes, but I'm not sure it will improve readability.




@@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
 struct timespec ts;
 ktime_t t, *tp = NULL;
 u64 val2 = 0;
+int opm = op  FUTEX_CMD_MASK;


What's opm stand for?


I guess 'm' stands for