Re: [patch] generic rwsems

2007-04-13 Thread David Howells
Eric Dumazet <[EMAIL PROTECTED]> wrote:

> If space considerations are that important, we could then reserve one bit
> for the 'wait_lock spinlock'

That makes life quite a bit more tricky, though it does have the advantage
that it closes the reader-jumping-writer window I mentioned.

> Another possibility to save space would be to move wait_lock/wait_list
> outside of rw_semaphore, in a hashed global array.

I suspect moving wait_list out would be a bad idea.  The ordering of things in
the list is very important.  You need to perform several operations on the
list, all of which would be potentially slower:

 (1) glance at the first element of the list to see what sort of wake up to do

 (2) iteration of the list when waking up multiple readers

 (3) seeing if the list is empty (so you know that there's no more contention)

Moving the spinlock out, on the other hand, might be worth it to cut down on
cacheline bouncing some more...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Eric Dumazet
On Fri, 13 Apr 2007 14:31:52 +0100
David Howells <[EMAIL PROTECTED]> wrote:


> Break the counter down like this:
> 
>   0x  - not locked; queue empty
>   0x4000  - locked by writer; queue empty
>   0xc000  - locket by writer; queue occupied
>   0x0nnn  - n readers; queue empty
>   0x8nnn  - n readers; queue occupied

If space considerations are that important, we could then reserve one bit for 
the 'wait_lock spinlock'

0x2000 : one cpu gained control of 'wait_list'

This would save 4 bytes on 32 bit platforms.

64 bit platforms could have a limit of 2^60 threads, instead of the way too 
small 2^28 one ;)

(we loose the debug version of spinlock of course)

Another possibility to save space would be to move wait_lock/wait_list outside 
of rw_semaphore, in a hashed global array.
This would save 12/16 bytes per rw_semaphore (inode structs are probably the 
most demanding)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
On Fri, Apr 13, 2007 at 02:31:52PM +0100, David Howells wrote:
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > The other way happens to be better for everyone else, which is why I
> > think your suggestion to instead move everyone to the spinlock version
> > was weird.
> 
> No, you misunderstand me.  My preferred solution is to leave it up to the arch
> and not to make it generic, though I'm not averse to providing some 
> prepackaged
> solutions for an arch to pick from if it wants to.

Just doesn't seem to be much payoff. I know you didn't think there is
anything wrong with 2 different impleemtnations and hundreds of lines
of arch specific assembly, but there is really little gain. Sure you
might be able to optimise a few cycles off the i386 asm, but damn I
just blitzed that sort of improvement on x86-64... and I still doubt
it will be very noticable because rwsems don't get called like
spinlocks.


> > Finally: as I said, even for those 2 architectures, this may not be so
> > bad because it is using a different spinlock for the fastpath as it is
> > for the slowpath. So your uncontested, cache cold case will get a little
> > slower, but the contested case could improve a lot (eg. I saw an order of
> > magnitude improvement).
> 
> Agreed.  I can see why the spinlock implementation is bad on SMP.  By all 
> means
> change those cases, and reduce the spinlock implementation to an interrupt
> disablement only version that may only be used on UP only.

So you missed my point about this above. If your UP atomic ops are
slower than interrupt disabling, then implement the damn things using
interrupt disabling instead of whatever it is you are using that is
slower! ;)

> > 32-bit machines might indeed overflow, but if it hasn't been a problem
> > for i386 or (even 64-bit) powerpc yet, is it a real worry? 
> 
> It has happened, I believe.  People have tried having >32766 threads on a
> 32-bit box.  Mad it may be, but...

Anyway, I doubt all the 32-bit archs using atomics would convert to the
slower spinlocks, so maybe this just has to be a known issue.


> > This is what spinlocks and mutexes do, and they're much more common than
> > rwsems. I'm just trying to make it consistent, and you can muck around
> > with it all you want after that. It is actually very easy to inline
> > things now, unlike before my patch.
> 
> My original stuff was very easy to inline until Ingo got hold of it.

I agree that all the locking has turned pretty messy, but that isn't
my fault.


> > > Think about it.  This algorithm is only optimal where XADD is available.  
> > > If
> > > you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better.
> > 
> > You keep saying this too, and I have thought about it but I couldn't think
> > of a much better way. I'm not saying you're wrong, but why don't you just
> > tell me what that better way is?
> 
> Look at how the counter works in the spinlock case.  With the addition of an
> extra flag in the counter to indicate there's stuff waiting on the queue, you
> can manipulate the counter if it appears safe to do so, otherwise you have to
> fall back to the slow path and take a spin lock.

[snip]

Ah, thanks. Yeah actually I remember you describing this at LCA, so I
apologise for saying you didn't ;)

Really, that isn't going to do much for performance (nothing as dramatic
as the x86-64 spinlock->atomic conversion). However it will reduce the
lock size by 8 bytes on 64-bit and fix the overflow on 32-bit...

So why don't we implement this as the generic version? UP archs won't
care because atomic_cmpxchg is generally just an interrupt disable,
similarly for sparc and parisc. Most others except x86 do atomic_add_return
with llsc or cas anyway, and even if the cmpxchg is a tiny bit slower for
x86, at least it should be much faster than the spinlock version for
x86-64, and will solve the overflow for i386. 
 
What do you say?

> > > similarly if you are using a UP-compiled kernel.
> > 
> > Then your UP-compiled kernel's atomic ops are suboptimal, not the rwsem
> > implementation.
> 
> That's usually a function of the CPU designer.

I mean your atomic_xxx functions are suboptimal for that design of CPU.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> The other way happens to be better for everyone else, which is why I
> think your suggestion to instead move everyone to the spinlock version
> was weird.

No, you misunderstand me.  My preferred solution is to leave it up to the arch
and not to make it generic, though I'm not averse to providing some prepackaged
solutions for an arch to pick from if it wants to.

> Finally: as I said, even for those 2 architectures, this may not be so
> bad because it is using a different spinlock for the fastpath as it is
> for the slowpath. So your uncontested, cache cold case will get a little
> slower, but the contested case could improve a lot (eg. I saw an order of
> magnitude improvement).

Agreed.  I can see why the spinlock implementation is bad on SMP.  By all means
change those cases, and reduce the spinlock implementation to an interrupt
disablement only version that may only be used on UP only.

> 32-bit machines might indeed overflow, but if it hasn't been a problem
> for i386 or (even 64-bit) powerpc yet, is it a real worry? 

It has happened, I believe.  People have tried having >32766 threads on a
32-bit box.  Mad it may be, but...

> This is what spinlocks and mutexes do, and they're much more common than
> rwsems. I'm just trying to make it consistent, and you can muck around
> with it all you want after that. It is actually very easy to inline
> things now, unlike before my patch.

My original stuff was very easy to inline until Ingo got hold of it.

> > Think about it.  This algorithm is only optimal where XADD is available.  If
> > you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better.
> 
> You keep saying this too, and I have thought about it but I couldn't think
> of a much better way. I'm not saying you're wrong, but why don't you just
> tell me what that better way is?

Look at how the counter works in the spinlock case.  With the addition of an
extra flag in the counter to indicate there's stuff waiting on the queue, you
can manipulate the counter if it appears safe to do so, otherwise you have to
fall back to the slow path and take a spin lock.

Break the counter down like this:

0x  - not locked; queue empty
0x4000  - locked by writer; queue empty
0xc000  - locket by writer; queue occupied
0x0nnn  - n readers; queue empty
0x8nnn  - n readers; queue occupied

Now here's a rough guide as to how the main operations would work:

 (*) down_read of unlocked

cmpxchg(0 -> 1) -> 0 [okay - you've got the lock]

 (*) down_read of readlocked.

cmpxchg(0 -> 1) -> n [failed to get the lock]
do
  n = cmpxchg(old_n -> old_n+1)
until n == old_n

 (*) down_read of writelocked or contented readlocked.

cmpxchg(0 -> 1) -> 0x8000|n  [lock contended]
goto slowpath
  spinlock
  try to get lock again
  if still contended
mark counter contended
add to queue
spinunlock
sleep
  spinunlock

 (*) down_write of unlocked

cmpxchg(0 -> 0x4000) -> 0 [okay - you've got the lock]

 (*) down_write of locked, contended or otherwise

cmpxchg(0 -> 0x4000) -> nz [failed]
goto slowpath
  spinlock
  try to get lock again
  if still unavailable
mark counter contended
add to queue
spinunlock
sleep
  else
spinunlock

 (*) up_read

x = cmpxchg(1 -> 0)
if x == 0
  done
else
   do
 x = cmpxchg(old_x -> old_x-1)
   until x == old_x
   if old_x == 0x8000
 wake_up_writer

 (*) up_write

x = cmpxchg(0x8000 -> 0)
if x == 0
  done
else
  wake_up_readers

You can actually do better with LL/SC here because for the initial attempt with
CMPXCHG in each case you have to guess what the numbers will be.  Furthermore,
you might actually be able to do an "atomic increment or set contention flag"
operation.

Note that the contention flag may only be set or cleared in the slow path
whilst you are holding the spinlock.

Doing down_read and up_read with CMPXCHG is a pain.  XADD or LL/SC would be
better, and LOCK INC/ADD/DEC/SUB won't work.  You can't use XADD in down_*() as
you may not change the bottom part of the counter if you're going to end up
queuing.

Actually, looking at it, it might be better to have the counter start off at
0x8000 for "unlocked, no contention" and clear the no-contention flag when
you queue something.  That way you can check for the counter becoming 0 in the
up_*() functions as a trigger to go and invoke the slowpath.  Then you could
use LOCK DEC/SUB on i386 rather than XADD as you only need to check the Z flag.

Note there is a slight window whereby a reader can jump a writer that's
transiting 

Re: [patch] generic rwsems

2007-04-13 Thread William Lee Irwin III
On Fri, Apr 13, 2007 at 02:43:03PM +0200, Nick Piggin wrote:
> Yes, this is the case on our 2 premiere SMP powerhouse architectures,
> sparc32 and parsic.

sparc32 is ultra-legacy and I have a tremendous amount of work to do on
SMP there. I don't feel that efficiency of locking primitives is a
crucial issue for sparc32.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
David, you keep saying the same things and don't listen to me.

On Fri, Apr 13, 2007 at 01:09:42PM +0100, David Howells wrote:
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > This patch converts all architectures to a generic rwsem implementation,
> > which will compile down to the same code for i386, or powerpc, for
> > example,
> 
> > and will allow some (eg. x86-64) to move away from spinlock based rwsems.
> 
> Which are better on UP kernels because spinlocks degrade to nothing, and then
> you're left with a single disable/enable interrupt pair per operation, and no
> requirement for atomic ops at all.

On UP, if an IRQ disable/enable pair operation is _faster_ than the atomic
op, then the architecture can and should impelemnt atomic ops on UP by
doing exactly that.


> What you propose may wind up with several per op because if the CPU does not
> support atomic ops directly and cannot emulate them through other atomic ops,
> then these have to be emulated by:
> 
>   atomic_op() {
>   spin_lock_irqsave
>   do op
>   spin_unlock_irqrestore
>   }

Yes, this is the case on our 2 premiere SMP powerhouse architectures,
sparc32 and parsic.

The other way happens to be better for everyone else, which is why I
think your suggestion to instead move everyone to the spinlock version
was weird.

Finally: as I said, even for those 2 architectures, this may not be so
bad because it is using a different spinlock for the fastpath as it is
for the slowpath. So your uncontested, cache cold case will get a little
slower, but the contested case could improve a lot (eg. I saw an order of
magnitude improvement).


> > Move to an architecture independent rwsem implementation, using the
> > better of the two rwsem implementations
> 
> That's not necessarily the case, as I said above.
> 
> Furthermore, the spinlock implementation struct is smaller on 64-bit machines,
> and is less prone to counter overrun on 32-bit machines.

I think 64-bit machines will be happy to take the extra word it if they
have double the single threaded performance, quadruple the parallel read
performance, and 10 times the contested read performance.

32-bit machines might indeed overflow, but if it hasn't been a problem
for i386 or (even 64-bit) powerpc yet, is it a real worry? 


> > Out-of-line the fastpaths, to bring rw-semaphores into line with
> > mutexes and spinlocks WRT our icache vs function call policy.
> 
> That should depend on whether you optimise for space or for speed.  Function
> calls are relatively heavyweight.

This is what spinlocks and mutexes do, and they're much more common than
rwsems. I'm just trying to make it consistent, and you can muck around
with it all you want after that. It is actually very easy to inline
things now, unlike before my patch.


> Please re-inline and fix Ingo's mess if you must clean up.  Take the i386
> version, for instance, I'd made it so that the compiler didn't know it was
> taking a function call when it went down the slow path, thus meaning the
> compiler didn't have to deal with that.  Furthermore, it only interpolated two
> or three instructions into the calling code in the fastpath.  It's a real 
> shame
> that gcc inline asm doesn't allow you to use status flags as boolean returns,
> otherwise I could reduce that even further.

I cleaned it.


> > Spinlock based rwsems are inferior to atomic based ones one most
> > architectures that can do atomic ops without spinlocks:
> 
> Note the "most" in your statement...
> 
> Think about it.  This algorithm is only optimal where XADD is available.  If
> you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better.

You keep saying this too, and I have thought about it but I couldn't think
of a much better way. I'm not saying you're wrong, but why don't you just
tell me what that better way is?


> If the only atomic op you have is XCHG, then this is a really poor choice;

What is better? spinlocks? I think that considering only 2 dead archs
really care at this stage, and I have good reason to believe that the
contested case will be impreoved, then why don't you come up with some
numbers to prove me wrong?


> similarly if you are using a UP-compiled kernel.

Then your UP-compiled kernel's atomic ops are suboptimal, not the rwsem
implementation.

Anyway, thanks for taking the time again. If you would please address each
of my points, then we might finally be able to stop having this discussion
every 6 months ;)

Nick

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 12:44:50PM +0100, David Howells wrote:
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > I think I should put wait_lock after wait_list, so as to get a better
> > packing on most 64-bit architectures.
> 
> It makes no difference.  struct lockdep_map contains at least one pointer and
> so is going to be 8-byte aligned (assuming it's there at all).  struct
> rw_semaphore contains at least one pointer/long, so it will be padded out to
> 8-byte size.

I hope people are not going to enabled lockdep on their production systems :)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> This patch converts all architectures to a generic rwsem implementation,
> which will compile down to the same code for i386, or powerpc, for
> example,

> and will allow some (eg. x86-64) to move away from spinlock based rwsems.

Which are better on UP kernels because spinlocks degrade to nothing, and then
you're left with a single disable/enable interrupt pair per operation, and no
requirement for atomic ops at all.

What you propose may wind up with several per op because if the CPU does not
support atomic ops directly and cannot emulate them through other atomic ops,
then these have to be emulated by:

atomic_op() {
spin_lock_irqsave
do op
spin_unlock_irqrestore
}

> Move to an architecture independent rwsem implementation, using the
> better of the two rwsem implementations

That's not necessarily the case, as I said above.

Furthermore, the spinlock implementation struct is smaller on 64-bit machines,
and is less prone to counter overrun on 32-bit machines.

> Out-of-line the fastpaths, to bring rw-semaphores into line with
> mutexes and spinlocks WRT our icache vs function call policy.

That should depend on whether you optimise for space or for speed.  Function
calls are relatively heavyweight.

Please re-inline and fix Ingo's mess if you must clean up.  Take the i386
version, for instance, I'd made it so that the compiler didn't know it was
taking a function call when it went down the slow path, thus meaning the
compiler didn't have to deal with that.  Furthermore, it only interpolated two
or three instructions into the calling code in the fastpath.  It's a real shame
that gcc inline asm doesn't allow you to use status flags as boolean returns,
otherwise I could reduce that even further.

> Spinlock based rwsems are inferior to atomic based ones one most
> architectures that can do atomic ops without spinlocks:

Note the "most" in your statement...

Think about it.  This algorithm is only optimal where XADD is available.  If
you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better.

If the only atomic op you have is XCHG, then this is a really poor choice;
similarly if you are using a UP-compiled kernel.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> I think I should put wait_lock after wait_list, so as to get a better
> packing on most 64-bit architectures.

It makes no difference.  struct lockdep_map contains at least one pointer and
so is going to be 8-byte aligned (assuming it's there at all).  struct
rw_semaphore contains at least one pointer/long, so it will be padded out to
8-byte size.

If you want to make a difference, you'd need to add __attribute__((packed))
but you would need to be very careful with that.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
On Fri, Apr 13, 2007 at 12:53:49PM +0200, Andi Kleen wrote:
> On Friday 13 April 2007 12:04:16 Nick Piggin wrote:
> > OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches.
> > 
> > Last time this came up I was asked to get some numbers, so here are
> > some in the changelog, captured with a simple kernel module tester.
> > I got motivated again because of the MySQL/glibc/mmap_sem issue.
> > 
> > This patch converts all architectures to a generic rwsem implementation,
> > which will compile down to the same code for i386, or powerpc, for
> > example, and will allow some (eg. x86-64) to move away from spinlock
> > based rwsems.
> > 
> > Comments?
> 
> Fine for me from the x86-64 side. Some more validation with a test suite
> would be good though.

David had a test suite somewhere, so I'll give that a run.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fastcalls, was Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
On Fri, Apr 13, 2007 at 11:19:30AM +0100, Christoph Hellwig wrote:
> On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote:
> > Remove one level of indirection (kernel/rwsem.c -> lib/rwsem.c), and
> > give a bit of a cleanup (eg remove the fastcall junk) to make the
> > code a bit easier to read.
> 
> Arpopos fastcalls, now that -mregparam=3 is the defaul on i386 and
> FASTCALL/fastcall is a noop everywhere else can we please get rid
> of this attribute entirely?  If any other architecture provides
> a more optimal non-standard calling convention we can just use
> it by default in kernelspace as a few architectures already do.

I don't see why not. David objected me removing them I think because
the rwsem code called some from asm (but of course that's removed too).
However AFAIK, this situation should be using the asmlinkage attribute
anyway.

fastcall in kernel is annoying... it seems to imply that we normally
rather slow calls ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Andi Kleen
On Friday 13 April 2007 12:04:16 Nick Piggin wrote:
> OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches.
> 
> Last time this came up I was asked to get some numbers, so here are
> some in the changelog, captured with a simple kernel module tester.
> I got motivated again because of the MySQL/glibc/mmap_sem issue.
> 
> This patch converts all architectures to a generic rwsem implementation,
> which will compile down to the same code for i386, or powerpc, for
> example, and will allow some (eg. x86-64) to move away from spinlock
> based rwsems.
> 
> Comments?

Fine for me from the x86-64 side. Some more validation with a test suite
would be good though.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote:
> OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches.
> 
> Last time this came up I was asked to get some numbers, so here are
> some in the changelog, captured with a simple kernel module tester.
> I got motivated again because of the MySQL/glibc/mmap_sem issue.
> 
> This patch converts all architectures to a generic rwsem implementation,
> which will compile down to the same code for i386, or powerpc, for
> example, and will allow some (eg. x86-64) to move away from spinlock
> based rwsems.

Oh, and it also converts powerpc and sparc64 to 64-bit counters, so
they can handle more than 32K tasks waiting (which was apparently a
real problem for SGI, and is probably a good thing).

But that reminds me:
> +/*
> + * the semaphore definition
> + */
> +struct rw_semaphore {
> + atomic_long_t   count;
> + spinlock_t  wait_lock;
> + struct list_headwait_list;
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> + struct lockdep_map dep_map;
> +#endif
> +};

I think I should put wait_lock after wait_list, so as to get a better
packing on most 64-bit architectures.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


fastcalls, was Re: [patch] generic rwsems

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote:
> Remove one level of indirection (kernel/rwsem.c -> lib/rwsem.c), and
> give a bit of a cleanup (eg remove the fastcall junk) to make the
> code a bit easier to read.

Arpopos fastcalls, now that -mregparam=3 is the defaul on i386 and
FASTCALL/fastcall is a noop everywhere else can we please get rid
of this attribute entirely?  If any other architecture provides
a more optimal non-standard calling convention we can just use
it by default in kernelspace as a few architectures already do.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


fastcalls, was Re: [patch] generic rwsems

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote:
 Remove one level of indirection (kernel/rwsem.c - lib/rwsem.c), and
 give a bit of a cleanup (eg remove the fastcall junk) to make the
 code a bit easier to read.

Arpopos fastcalls, now that -mregparam=3 is the defaul on i386 and
FASTCALL/fastcall is a noop everywhere else can we please get rid
of this attribute entirely?  If any other architecture provides
a more optimal non-standard calling convention we can just use
it by default in kernelspace as a few architectures already do.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote:
 OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches.
 
 Last time this came up I was asked to get some numbers, so here are
 some in the changelog, captured with a simple kernel module tester.
 I got motivated again because of the MySQL/glibc/mmap_sem issue.
 
 This patch converts all architectures to a generic rwsem implementation,
 which will compile down to the same code for i386, or powerpc, for
 example, and will allow some (eg. x86-64) to move away from spinlock
 based rwsems.

Oh, and it also converts powerpc and sparc64 to 64-bit counters, so
they can handle more than 32K tasks waiting (which was apparently a
real problem for SGI, and is probably a good thing).

But that reminds me:
 +/*
 + * the semaphore definition
 + */
 +struct rw_semaphore {
 + atomic_long_t   count;
 + spinlock_t  wait_lock;
 + struct list_headwait_list;
 +#ifdef CONFIG_DEBUG_LOCK_ALLOC
 + struct lockdep_map dep_map;
 +#endif
 +};

I think I should put wait_lock after wait_list, so as to get a better
packing on most 64-bit architectures.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Andi Kleen
On Friday 13 April 2007 12:04:16 Nick Piggin wrote:
 OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches.
 
 Last time this came up I was asked to get some numbers, so here are
 some in the changelog, captured with a simple kernel module tester.
 I got motivated again because of the MySQL/glibc/mmap_sem issue.
 
 This patch converts all architectures to a generic rwsem implementation,
 which will compile down to the same code for i386, or powerpc, for
 example, and will allow some (eg. x86-64) to move away from spinlock
 based rwsems.
 
 Comments?

Fine for me from the x86-64 side. Some more validation with a test suite
would be good though.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fastcalls, was Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
On Fri, Apr 13, 2007 at 11:19:30AM +0100, Christoph Hellwig wrote:
 On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote:
  Remove one level of indirection (kernel/rwsem.c - lib/rwsem.c), and
  give a bit of a cleanup (eg remove the fastcall junk) to make the
  code a bit easier to read.
 
 Arpopos fastcalls, now that -mregparam=3 is the defaul on i386 and
 FASTCALL/fastcall is a noop everywhere else can we please get rid
 of this attribute entirely?  If any other architecture provides
 a more optimal non-standard calling convention we can just use
 it by default in kernelspace as a few architectures already do.

I don't see why not. David objected me removing them I think because
the rwsem code called some from asm (but of course that's removed too).
However AFAIK, this situation should be using the asmlinkage attribute
anyway.

fastcall in kernel is annoying... it seems to imply that we normally
rather slow calls ;)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
On Fri, Apr 13, 2007 at 12:53:49PM +0200, Andi Kleen wrote:
 On Friday 13 April 2007 12:04:16 Nick Piggin wrote:
  OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches.
  
  Last time this came up I was asked to get some numbers, so here are
  some in the changelog, captured with a simple kernel module tester.
  I got motivated again because of the MySQL/glibc/mmap_sem issue.
  
  This patch converts all architectures to a generic rwsem implementation,
  which will compile down to the same code for i386, or powerpc, for
  example, and will allow some (eg. x86-64) to move away from spinlock
  based rwsems.
  
  Comments?
 
 Fine for me from the x86-64 side. Some more validation with a test suite
 would be good though.

David had a test suite somewhere, so I'll give that a run.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread David Howells
Nick Piggin [EMAIL PROTECTED] wrote:

 I think I should put wait_lock after wait_list, so as to get a better
 packing on most 64-bit architectures.

It makes no difference.  struct lockdep_map contains at least one pointer and
so is going to be 8-byte aligned (assuming it's there at all).  struct
rw_semaphore contains at least one pointer/long, so it will be padded out to
8-byte size.

If you want to make a difference, you'd need to add __attribute__((packed))
but you would need to be very careful with that.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread David Howells
Nick Piggin [EMAIL PROTECTED] wrote:

 This patch converts all architectures to a generic rwsem implementation,
 which will compile down to the same code for i386, or powerpc, for
 example,

 and will allow some (eg. x86-64) to move away from spinlock based rwsems.

Which are better on UP kernels because spinlocks degrade to nothing, and then
you're left with a single disable/enable interrupt pair per operation, and no
requirement for atomic ops at all.

What you propose may wind up with several per op because if the CPU does not
support atomic ops directly and cannot emulate them through other atomic ops,
then these have to be emulated by:

atomic_op() {
spin_lock_irqsave
do op
spin_unlock_irqrestore
}

 Move to an architecture independent rwsem implementation, using the
 better of the two rwsem implementations

That's not necessarily the case, as I said above.

Furthermore, the spinlock implementation struct is smaller on 64-bit machines,
and is less prone to counter overrun on 32-bit machines.

 Out-of-line the fastpaths, to bring rw-semaphores into line with
 mutexes and spinlocks WRT our icache vs function call policy.

That should depend on whether you optimise for space or for speed.  Function
calls are relatively heavyweight.

Please re-inline and fix Ingo's mess if you must clean up.  Take the i386
version, for instance, I'd made it so that the compiler didn't know it was
taking a function call when it went down the slow path, thus meaning the
compiler didn't have to deal with that.  Furthermore, it only interpolated two
or three instructions into the calling code in the fastpath.  It's a real shame
that gcc inline asm doesn't allow you to use status flags as boolean returns,
otherwise I could reduce that even further.

 Spinlock based rwsems are inferior to atomic based ones one most
 architectures that can do atomic ops without spinlocks:

Note the most in your statement...

Think about it.  This algorithm is only optimal where XADD is available.  If
you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better.

If the only atomic op you have is XCHG, then this is a really poor choice;
similarly if you are using a UP-compiled kernel.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 12:44:50PM +0100, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  I think I should put wait_lock after wait_list, so as to get a better
  packing on most 64-bit architectures.
 
 It makes no difference.  struct lockdep_map contains at least one pointer and
 so is going to be 8-byte aligned (assuming it's there at all).  struct
 rw_semaphore contains at least one pointer/long, so it will be padded out to
 8-byte size.

I hope people are not going to enabled lockdep on their production systems :)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
David, you keep saying the same things and don't listen to me.

On Fri, Apr 13, 2007 at 01:09:42PM +0100, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  This patch converts all architectures to a generic rwsem implementation,
  which will compile down to the same code for i386, or powerpc, for
  example,
 
  and will allow some (eg. x86-64) to move away from spinlock based rwsems.
 
 Which are better on UP kernels because spinlocks degrade to nothing, and then
 you're left with a single disable/enable interrupt pair per operation, and no
 requirement for atomic ops at all.

On UP, if an IRQ disable/enable pair operation is _faster_ than the atomic
op, then the architecture can and should impelemnt atomic ops on UP by
doing exactly that.


 What you propose may wind up with several per op because if the CPU does not
 support atomic ops directly and cannot emulate them through other atomic ops,
 then these have to be emulated by:
 
   atomic_op() {
   spin_lock_irqsave
   do op
   spin_unlock_irqrestore
   }

Yes, this is the case on our 2 premiere SMP powerhouse architectures,
sparc32 and parsic.

The other way happens to be better for everyone else, which is why I
think your suggestion to instead move everyone to the spinlock version
was weird.

Finally: as I said, even for those 2 architectures, this may not be so
bad because it is using a different spinlock for the fastpath as it is
for the slowpath. So your uncontested, cache cold case will get a little
slower, but the contested case could improve a lot (eg. I saw an order of
magnitude improvement).


  Move to an architecture independent rwsem implementation, using the
  better of the two rwsem implementations
 
 That's not necessarily the case, as I said above.
 
 Furthermore, the spinlock implementation struct is smaller on 64-bit machines,
 and is less prone to counter overrun on 32-bit machines.

I think 64-bit machines will be happy to take the extra word it if they
have double the single threaded performance, quadruple the parallel read
performance, and 10 times the contested read performance.

32-bit machines might indeed overflow, but if it hasn't been a problem
for i386 or (even 64-bit) powerpc yet, is it a real worry? 


  Out-of-line the fastpaths, to bring rw-semaphores into line with
  mutexes and spinlocks WRT our icache vs function call policy.
 
 That should depend on whether you optimise for space or for speed.  Function
 calls are relatively heavyweight.

This is what spinlocks and mutexes do, and they're much more common than
rwsems. I'm just trying to make it consistent, and you can muck around
with it all you want after that. It is actually very easy to inline
things now, unlike before my patch.


 Please re-inline and fix Ingo's mess if you must clean up.  Take the i386
 version, for instance, I'd made it so that the compiler didn't know it was
 taking a function call when it went down the slow path, thus meaning the
 compiler didn't have to deal with that.  Furthermore, it only interpolated two
 or three instructions into the calling code in the fastpath.  It's a real 
 shame
 that gcc inline asm doesn't allow you to use status flags as boolean returns,
 otherwise I could reduce that even further.

I cleaned it.


  Spinlock based rwsems are inferior to atomic based ones one most
  architectures that can do atomic ops without spinlocks:
 
 Note the most in your statement...
 
 Think about it.  This algorithm is only optimal where XADD is available.  If
 you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better.

You keep saying this too, and I have thought about it but I couldn't think
of a much better way. I'm not saying you're wrong, but why don't you just
tell me what that better way is?


 If the only atomic op you have is XCHG, then this is a really poor choice;

What is better? spinlocks? I think that considering only 2 dead archs
really care at this stage, and I have good reason to believe that the
contested case will be impreoved, then why don't you come up with some
numbers to prove me wrong?


 similarly if you are using a UP-compiled kernel.

Then your UP-compiled kernel's atomic ops are suboptimal, not the rwsem
implementation.

Anyway, thanks for taking the time again. If you would please address each
of my points, then we might finally be able to stop having this discussion
every 6 months ;)

Nick

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread William Lee Irwin III
On Fri, Apr 13, 2007 at 02:43:03PM +0200, Nick Piggin wrote:
 Yes, this is the case on our 2 premiere SMP powerhouse architectures,
 sparc32 and parsic.

sparc32 is ultra-legacy and I have a tremendous amount of work to do on
SMP there. I don't feel that efficiency of locking primitives is a
crucial issue for sparc32.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread David Howells
Nick Piggin [EMAIL PROTECTED] wrote:

 The other way happens to be better for everyone else, which is why I
 think your suggestion to instead move everyone to the spinlock version
 was weird.

No, you misunderstand me.  My preferred solution is to leave it up to the arch
and not to make it generic, though I'm not averse to providing some prepackaged
solutions for an arch to pick from if it wants to.

 Finally: as I said, even for those 2 architectures, this may not be so
 bad because it is using a different spinlock for the fastpath as it is
 for the slowpath. So your uncontested, cache cold case will get a little
 slower, but the contested case could improve a lot (eg. I saw an order of
 magnitude improvement).

Agreed.  I can see why the spinlock implementation is bad on SMP.  By all means
change those cases, and reduce the spinlock implementation to an interrupt
disablement only version that may only be used on UP only.

 32-bit machines might indeed overflow, but if it hasn't been a problem
 for i386 or (even 64-bit) powerpc yet, is it a real worry? 

It has happened, I believe.  People have tried having 32766 threads on a
32-bit box.  Mad it may be, but...

 This is what spinlocks and mutexes do, and they're much more common than
 rwsems. I'm just trying to make it consistent, and you can muck around
 with it all you want after that. It is actually very easy to inline
 things now, unlike before my patch.

My original stuff was very easy to inline until Ingo got hold of it.

  Think about it.  This algorithm is only optimal where XADD is available.  If
  you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better.
 
 You keep saying this too, and I have thought about it but I couldn't think
 of a much better way. I'm not saying you're wrong, but why don't you just
 tell me what that better way is?

Look at how the counter works in the spinlock case.  With the addition of an
extra flag in the counter to indicate there's stuff waiting on the queue, you
can manipulate the counter if it appears safe to do so, otherwise you have to
fall back to the slow path and take a spin lock.

Break the counter down like this:

0x  - not locked; queue empty
0x4000  - locked by writer; queue empty
0xc000  - locket by writer; queue occupied
0x0nnn  - n readers; queue empty
0x8nnn  - n readers; queue occupied

Now here's a rough guide as to how the main operations would work:

 (*) down_read of unlocked

cmpxchg(0 - 1) - 0 [okay - you've got the lock]

 (*) down_read of readlocked.

cmpxchg(0 - 1) - n [failed to get the lock]
do
  n = cmpxchg(old_n - old_n+1)
until n == old_n

 (*) down_read of writelocked or contented readlocked.

cmpxchg(0 - 1) - 0x8000|n  [lock contended]
goto slowpath
  spinlock
  try to get lock again
  if still contended
mark counter contended
add to queue
spinunlock
sleep
  spinunlock

 (*) down_write of unlocked

cmpxchg(0 - 0x4000) - 0 [okay - you've got the lock]

 (*) down_write of locked, contended or otherwise

cmpxchg(0 - 0x4000) - nz [failed]
goto slowpath
  spinlock
  try to get lock again
  if still unavailable
mark counter contended
add to queue
spinunlock
sleep
  else
spinunlock

 (*) up_read

x = cmpxchg(1 - 0)
if x == 0
  done
else
   do
 x = cmpxchg(old_x - old_x-1)
   until x == old_x
   if old_x == 0x8000
 wake_up_writer

 (*) up_write

x = cmpxchg(0x8000 - 0)
if x == 0
  done
else
  wake_up_readers

You can actually do better with LL/SC here because for the initial attempt with
CMPXCHG in each case you have to guess what the numbers will be.  Furthermore,
you might actually be able to do an atomic increment or set contention flag
operation.

Note that the contention flag may only be set or cleared in the slow path
whilst you are holding the spinlock.

Doing down_read and up_read with CMPXCHG is a pain.  XADD or LL/SC would be
better, and LOCK INC/ADD/DEC/SUB won't work.  You can't use XADD in down_*() as
you may not change the bottom part of the counter if you're going to end up
queuing.

Actually, looking at it, it might be better to have the counter start off at
0x8000 for unlocked, no contention and clear the no-contention flag when
you queue something.  That way you can check for the counter becoming 0 in the
up_*() functions as a trigger to go and invoke the slowpath.  Then you could
use LOCK DEC/SUB on i386 rather than XADD as you only need to check the Z flag.

Note there is a slight window whereby a reader can jump a writer that's
transiting between the fastpath part of down_write() and 

Re: [patch] generic rwsems

2007-04-13 Thread Nick Piggin
On Fri, Apr 13, 2007 at 02:31:52PM +0100, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  The other way happens to be better for everyone else, which is why I
  think your suggestion to instead move everyone to the spinlock version
  was weird.
 
 No, you misunderstand me.  My preferred solution is to leave it up to the arch
 and not to make it generic, though I'm not averse to providing some 
 prepackaged
 solutions for an arch to pick from if it wants to.

Just doesn't seem to be much payoff. I know you didn't think there is
anything wrong with 2 different impleemtnations and hundreds of lines
of arch specific assembly, but there is really little gain. Sure you
might be able to optimise a few cycles off the i386 asm, but damn I
just blitzed that sort of improvement on x86-64... and I still doubt
it will be very noticable because rwsems don't get called like
spinlocks.


  Finally: as I said, even for those 2 architectures, this may not be so
  bad because it is using a different spinlock for the fastpath as it is
  for the slowpath. So your uncontested, cache cold case will get a little
  slower, but the contested case could improve a lot (eg. I saw an order of
  magnitude improvement).
 
 Agreed.  I can see why the spinlock implementation is bad on SMP.  By all 
 means
 change those cases, and reduce the spinlock implementation to an interrupt
 disablement only version that may only be used on UP only.

So you missed my point about this above. If your UP atomic ops are
slower than interrupt disabling, then implement the damn things using
interrupt disabling instead of whatever it is you are using that is
slower! ;)

  32-bit machines might indeed overflow, but if it hasn't been a problem
  for i386 or (even 64-bit) powerpc yet, is it a real worry? 
 
 It has happened, I believe.  People have tried having 32766 threads on a
 32-bit box.  Mad it may be, but...

Anyway, I doubt all the 32-bit archs using atomics would convert to the
slower spinlocks, so maybe this just has to be a known issue.


  This is what spinlocks and mutexes do, and they're much more common than
  rwsems. I'm just trying to make it consistent, and you can muck around
  with it all you want after that. It is actually very easy to inline
  things now, unlike before my patch.
 
 My original stuff was very easy to inline until Ingo got hold of it.

I agree that all the locking has turned pretty messy, but that isn't
my fault.


   Think about it.  This algorithm is only optimal where XADD is available.  
   If
   you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better.
  
  You keep saying this too, and I have thought about it but I couldn't think
  of a much better way. I'm not saying you're wrong, but why don't you just
  tell me what that better way is?
 
 Look at how the counter works in the spinlock case.  With the addition of an
 extra flag in the counter to indicate there's stuff waiting on the queue, you
 can manipulate the counter if it appears safe to do so, otherwise you have to
 fall back to the slow path and take a spin lock.

[snip]

Ah, thanks. Yeah actually I remember you describing this at LCA, so I
apologise for saying you didn't ;)

Really, that isn't going to do much for performance (nothing as dramatic
as the x86-64 spinlock-atomic conversion). However it will reduce the
lock size by 8 bytes on 64-bit and fix the overflow on 32-bit...

So why don't we implement this as the generic version? UP archs won't
care because atomic_cmpxchg is generally just an interrupt disable,
similarly for sparc and parisc. Most others except x86 do atomic_add_return
with llsc or cas anyway, and even if the cmpxchg is a tiny bit slower for
x86, at least it should be much faster than the spinlock version for
x86-64, and will solve the overflow for i386. 
 
What do you say?

   similarly if you are using a UP-compiled kernel.
  
  Then your UP-compiled kernel's atomic ops are suboptimal, not the rwsem
  implementation.
 
 That's usually a function of the CPU designer.

I mean your atomic_xxx functions are suboptimal for that design of CPU.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread Eric Dumazet
On Fri, 13 Apr 2007 14:31:52 +0100
David Howells [EMAIL PROTECTED] wrote:


 Break the counter down like this:
 
   0x  - not locked; queue empty
   0x4000  - locked by writer; queue empty
   0xc000  - locket by writer; queue occupied
   0x0nnn  - n readers; queue empty
   0x8nnn  - n readers; queue occupied

If space considerations are that important, we could then reserve one bit for 
the 'wait_lock spinlock'

0x2000 : one cpu gained control of 'wait_list'

This would save 4 bytes on 32 bit platforms.

64 bit platforms could have a limit of 2^60 threads, instead of the way too 
small 2^28 one ;)

(we loose the debug version of spinlock of course)

Another possibility to save space would be to move wait_lock/wait_list outside 
of rw_semaphore, in a hashed global array.
This would save 12/16 bytes per rw_semaphore (inode structs are probably the 
most demanding)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] generic rwsems

2007-04-13 Thread David Howells
Eric Dumazet [EMAIL PROTECTED] wrote:

 If space considerations are that important, we could then reserve one bit
 for the 'wait_lock spinlock'

That makes life quite a bit more tricky, though it does have the advantage
that it closes the reader-jumping-writer window I mentioned.

 Another possibility to save space would be to move wait_lock/wait_list
 outside of rw_semaphore, in a hashed global array.

I suspect moving wait_list out would be a bad idea.  The ordering of things in
the list is very important.  You need to perform several operations on the
list, all of which would be potentially slower:

 (1) glance at the first element of the list to see what sort of wake up to do

 (2) iteration of the list when waking up multiple readers

 (3) seeing if the list is empty (so you know that there's no more contention)

Moving the spinlock out, on the other hand, might be worth it to cut down on
cacheline bouncing some more...

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/