Re: [patch] generic rwsems
Eric Dumazet <[EMAIL PROTECTED]> wrote: > If space considerations are that important, we could then reserve one bit > for the 'wait_lock spinlock' That makes life quite a bit more tricky, though it does have the advantage that it closes the reader-jumping-writer window I mentioned. > Another possibility to save space would be to move wait_lock/wait_list > outside of rw_semaphore, in a hashed global array. I suspect moving wait_list out would be a bad idea. The ordering of things in the list is very important. You need to perform several operations on the list, all of which would be potentially slower: (1) glance at the first element of the list to see what sort of wake up to do (2) iteration of the list when waking up multiple readers (3) seeing if the list is empty (so you know that there's no more contention) Moving the spinlock out, on the other hand, might be worth it to cut down on cacheline bouncing some more... David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, 13 Apr 2007 14:31:52 +0100 David Howells <[EMAIL PROTECTED]> wrote: > Break the counter down like this: > > 0x - not locked; queue empty > 0x4000 - locked by writer; queue empty > 0xc000 - locket by writer; queue occupied > 0x0nnn - n readers; queue empty > 0x8nnn - n readers; queue occupied If space considerations are that important, we could then reserve one bit for the 'wait_lock spinlock' 0x2000 : one cpu gained control of 'wait_list' This would save 4 bytes on 32 bit platforms. 64 bit platforms could have a limit of 2^60 threads, instead of the way too small 2^28 one ;) (we loose the debug version of spinlock of course) Another possibility to save space would be to move wait_lock/wait_list outside of rw_semaphore, in a hashed global array. This would save 12/16 bytes per rw_semaphore (inode structs are probably the most demanding) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 02:31:52PM +0100, David Howells wrote: > Nick Piggin <[EMAIL PROTECTED]> wrote: > > > The other way happens to be better for everyone else, which is why I > > think your suggestion to instead move everyone to the spinlock version > > was weird. > > No, you misunderstand me. My preferred solution is to leave it up to the arch > and not to make it generic, though I'm not averse to providing some > prepackaged > solutions for an arch to pick from if it wants to. Just doesn't seem to be much payoff. I know you didn't think there is anything wrong with 2 different impleemtnations and hundreds of lines of arch specific assembly, but there is really little gain. Sure you might be able to optimise a few cycles off the i386 asm, but damn I just blitzed that sort of improvement on x86-64... and I still doubt it will be very noticable because rwsems don't get called like spinlocks. > > Finally: as I said, even for those 2 architectures, this may not be so > > bad because it is using a different spinlock for the fastpath as it is > > for the slowpath. So your uncontested, cache cold case will get a little > > slower, but the contested case could improve a lot (eg. I saw an order of > > magnitude improvement). > > Agreed. I can see why the spinlock implementation is bad on SMP. By all > means > change those cases, and reduce the spinlock implementation to an interrupt > disablement only version that may only be used on UP only. So you missed my point about this above. If your UP atomic ops are slower than interrupt disabling, then implement the damn things using interrupt disabling instead of whatever it is you are using that is slower! ;) > > 32-bit machines might indeed overflow, but if it hasn't been a problem > > for i386 or (even 64-bit) powerpc yet, is it a real worry? > > It has happened, I believe. People have tried having >32766 threads on a > 32-bit box. Mad it may be, but... Anyway, I doubt all the 32-bit archs using atomics would convert to the slower spinlocks, so maybe this just has to be a known issue. > > This is what spinlocks and mutexes do, and they're much more common than > > rwsems. I'm just trying to make it consistent, and you can muck around > > with it all you want after that. It is actually very easy to inline > > things now, unlike before my patch. > > My original stuff was very easy to inline until Ingo got hold of it. I agree that all the locking has turned pretty messy, but that isn't my fault. > > > Think about it. This algorithm is only optimal where XADD is available. > > > If > > > you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better. > > > > You keep saying this too, and I have thought about it but I couldn't think > > of a much better way. I'm not saying you're wrong, but why don't you just > > tell me what that better way is? > > Look at how the counter works in the spinlock case. With the addition of an > extra flag in the counter to indicate there's stuff waiting on the queue, you > can manipulate the counter if it appears safe to do so, otherwise you have to > fall back to the slow path and take a spin lock. [snip] Ah, thanks. Yeah actually I remember you describing this at LCA, so I apologise for saying you didn't ;) Really, that isn't going to do much for performance (nothing as dramatic as the x86-64 spinlock->atomic conversion). However it will reduce the lock size by 8 bytes on 64-bit and fix the overflow on 32-bit... So why don't we implement this as the generic version? UP archs won't care because atomic_cmpxchg is generally just an interrupt disable, similarly for sparc and parisc. Most others except x86 do atomic_add_return with llsc or cas anyway, and even if the cmpxchg is a tiny bit slower for x86, at least it should be much faster than the spinlock version for x86-64, and will solve the overflow for i386. What do you say? > > > similarly if you are using a UP-compiled kernel. > > > > Then your UP-compiled kernel's atomic ops are suboptimal, not the rwsem > > implementation. > > That's usually a function of the CPU designer. I mean your atomic_xxx functions are suboptimal for that design of CPU. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
Nick Piggin <[EMAIL PROTECTED]> wrote: > The other way happens to be better for everyone else, which is why I > think your suggestion to instead move everyone to the spinlock version > was weird. No, you misunderstand me. My preferred solution is to leave it up to the arch and not to make it generic, though I'm not averse to providing some prepackaged solutions for an arch to pick from if it wants to. > Finally: as I said, even for those 2 architectures, this may not be so > bad because it is using a different spinlock for the fastpath as it is > for the slowpath. So your uncontested, cache cold case will get a little > slower, but the contested case could improve a lot (eg. I saw an order of > magnitude improvement). Agreed. I can see why the spinlock implementation is bad on SMP. By all means change those cases, and reduce the spinlock implementation to an interrupt disablement only version that may only be used on UP only. > 32-bit machines might indeed overflow, but if it hasn't been a problem > for i386 or (even 64-bit) powerpc yet, is it a real worry? It has happened, I believe. People have tried having >32766 threads on a 32-bit box. Mad it may be, but... > This is what spinlocks and mutexes do, and they're much more common than > rwsems. I'm just trying to make it consistent, and you can muck around > with it all you want after that. It is actually very easy to inline > things now, unlike before my patch. My original stuff was very easy to inline until Ingo got hold of it. > > Think about it. This algorithm is only optimal where XADD is available. If > > you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better. > > You keep saying this too, and I have thought about it but I couldn't think > of a much better way. I'm not saying you're wrong, but why don't you just > tell me what that better way is? Look at how the counter works in the spinlock case. With the addition of an extra flag in the counter to indicate there's stuff waiting on the queue, you can manipulate the counter if it appears safe to do so, otherwise you have to fall back to the slow path and take a spin lock. Break the counter down like this: 0x - not locked; queue empty 0x4000 - locked by writer; queue empty 0xc000 - locket by writer; queue occupied 0x0nnn - n readers; queue empty 0x8nnn - n readers; queue occupied Now here's a rough guide as to how the main operations would work: (*) down_read of unlocked cmpxchg(0 -> 1) -> 0 [okay - you've got the lock] (*) down_read of readlocked. cmpxchg(0 -> 1) -> n [failed to get the lock] do n = cmpxchg(old_n -> old_n+1) until n == old_n (*) down_read of writelocked or contented readlocked. cmpxchg(0 -> 1) -> 0x8000|n [lock contended] goto slowpath spinlock try to get lock again if still contended mark counter contended add to queue spinunlock sleep spinunlock (*) down_write of unlocked cmpxchg(0 -> 0x4000) -> 0 [okay - you've got the lock] (*) down_write of locked, contended or otherwise cmpxchg(0 -> 0x4000) -> nz [failed] goto slowpath spinlock try to get lock again if still unavailable mark counter contended add to queue spinunlock sleep else spinunlock (*) up_read x = cmpxchg(1 -> 0) if x == 0 done else do x = cmpxchg(old_x -> old_x-1) until x == old_x if old_x == 0x8000 wake_up_writer (*) up_write x = cmpxchg(0x8000 -> 0) if x == 0 done else wake_up_readers You can actually do better with LL/SC here because for the initial attempt with CMPXCHG in each case you have to guess what the numbers will be. Furthermore, you might actually be able to do an "atomic increment or set contention flag" operation. Note that the contention flag may only be set or cleared in the slow path whilst you are holding the spinlock. Doing down_read and up_read with CMPXCHG is a pain. XADD or LL/SC would be better, and LOCK INC/ADD/DEC/SUB won't work. You can't use XADD in down_*() as you may not change the bottom part of the counter if you're going to end up queuing. Actually, looking at it, it might be better to have the counter start off at 0x8000 for "unlocked, no contention" and clear the no-contention flag when you queue something. That way you can check for the counter becoming 0 in the up_*() functions as a trigger to go and invoke the slowpath. Then you could use LOCK DEC/SUB on i386 rather than XADD as you only need to check the Z flag. Note there is a slight window whereby a reader can jump a writer that's transiting
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 02:43:03PM +0200, Nick Piggin wrote: > Yes, this is the case on our 2 premiere SMP powerhouse architectures, > sparc32 and parsic. sparc32 is ultra-legacy and I have a tremendous amount of work to do on SMP there. I don't feel that efficiency of locking primitives is a crucial issue for sparc32. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
David, you keep saying the same things and don't listen to me. On Fri, Apr 13, 2007 at 01:09:42PM +0100, David Howells wrote: > Nick Piggin <[EMAIL PROTECTED]> wrote: > > > This patch converts all architectures to a generic rwsem implementation, > > which will compile down to the same code for i386, or powerpc, for > > example, > > > and will allow some (eg. x86-64) to move away from spinlock based rwsems. > > Which are better on UP kernels because spinlocks degrade to nothing, and then > you're left with a single disable/enable interrupt pair per operation, and no > requirement for atomic ops at all. On UP, if an IRQ disable/enable pair operation is _faster_ than the atomic op, then the architecture can and should impelemnt atomic ops on UP by doing exactly that. > What you propose may wind up with several per op because if the CPU does not > support atomic ops directly and cannot emulate them through other atomic ops, > then these have to be emulated by: > > atomic_op() { > spin_lock_irqsave > do op > spin_unlock_irqrestore > } Yes, this is the case on our 2 premiere SMP powerhouse architectures, sparc32 and parsic. The other way happens to be better for everyone else, which is why I think your suggestion to instead move everyone to the spinlock version was weird. Finally: as I said, even for those 2 architectures, this may not be so bad because it is using a different spinlock for the fastpath as it is for the slowpath. So your uncontested, cache cold case will get a little slower, but the contested case could improve a lot (eg. I saw an order of magnitude improvement). > > Move to an architecture independent rwsem implementation, using the > > better of the two rwsem implementations > > That's not necessarily the case, as I said above. > > Furthermore, the spinlock implementation struct is smaller on 64-bit machines, > and is less prone to counter overrun on 32-bit machines. I think 64-bit machines will be happy to take the extra word it if they have double the single threaded performance, quadruple the parallel read performance, and 10 times the contested read performance. 32-bit machines might indeed overflow, but if it hasn't been a problem for i386 or (even 64-bit) powerpc yet, is it a real worry? > > Out-of-line the fastpaths, to bring rw-semaphores into line with > > mutexes and spinlocks WRT our icache vs function call policy. > > That should depend on whether you optimise for space or for speed. Function > calls are relatively heavyweight. This is what spinlocks and mutexes do, and they're much more common than rwsems. I'm just trying to make it consistent, and you can muck around with it all you want after that. It is actually very easy to inline things now, unlike before my patch. > Please re-inline and fix Ingo's mess if you must clean up. Take the i386 > version, for instance, I'd made it so that the compiler didn't know it was > taking a function call when it went down the slow path, thus meaning the > compiler didn't have to deal with that. Furthermore, it only interpolated two > or three instructions into the calling code in the fastpath. It's a real > shame > that gcc inline asm doesn't allow you to use status flags as boolean returns, > otherwise I could reduce that even further. I cleaned it. > > Spinlock based rwsems are inferior to atomic based ones one most > > architectures that can do atomic ops without spinlocks: > > Note the "most" in your statement... > > Think about it. This algorithm is only optimal where XADD is available. If > you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better. You keep saying this too, and I have thought about it but I couldn't think of a much better way. I'm not saying you're wrong, but why don't you just tell me what that better way is? > If the only atomic op you have is XCHG, then this is a really poor choice; What is better? spinlocks? I think that considering only 2 dead archs really care at this stage, and I have good reason to believe that the contested case will be impreoved, then why don't you come up with some numbers to prove me wrong? > similarly if you are using a UP-compiled kernel. Then your UP-compiled kernel's atomic ops are suboptimal, not the rwsem implementation. Anyway, thanks for taking the time again. If you would please address each of my points, then we might finally be able to stop having this discussion every 6 months ;) Nick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 12:44:50PM +0100, David Howells wrote: > Nick Piggin <[EMAIL PROTECTED]> wrote: > > > I think I should put wait_lock after wait_list, so as to get a better > > packing on most 64-bit architectures. > > It makes no difference. struct lockdep_map contains at least one pointer and > so is going to be 8-byte aligned (assuming it's there at all). struct > rw_semaphore contains at least one pointer/long, so it will be padded out to > 8-byte size. I hope people are not going to enabled lockdep on their production systems :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
Nick Piggin <[EMAIL PROTECTED]> wrote: > This patch converts all architectures to a generic rwsem implementation, > which will compile down to the same code for i386, or powerpc, for > example, > and will allow some (eg. x86-64) to move away from spinlock based rwsems. Which are better on UP kernels because spinlocks degrade to nothing, and then you're left with a single disable/enable interrupt pair per operation, and no requirement for atomic ops at all. What you propose may wind up with several per op because if the CPU does not support atomic ops directly and cannot emulate them through other atomic ops, then these have to be emulated by: atomic_op() { spin_lock_irqsave do op spin_unlock_irqrestore } > Move to an architecture independent rwsem implementation, using the > better of the two rwsem implementations That's not necessarily the case, as I said above. Furthermore, the spinlock implementation struct is smaller on 64-bit machines, and is less prone to counter overrun on 32-bit machines. > Out-of-line the fastpaths, to bring rw-semaphores into line with > mutexes and spinlocks WRT our icache vs function call policy. That should depend on whether you optimise for space or for speed. Function calls are relatively heavyweight. Please re-inline and fix Ingo's mess if you must clean up. Take the i386 version, for instance, I'd made it so that the compiler didn't know it was taking a function call when it went down the slow path, thus meaning the compiler didn't have to deal with that. Furthermore, it only interpolated two or three instructions into the calling code in the fastpath. It's a real shame that gcc inline asm doesn't allow you to use status flags as boolean returns, otherwise I could reduce that even further. > Spinlock based rwsems are inferior to atomic based ones one most > architectures that can do atomic ops without spinlocks: Note the "most" in your statement... Think about it. This algorithm is only optimal where XADD is available. If you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better. If the only atomic op you have is XCHG, then this is a really poor choice; similarly if you are using a UP-compiled kernel. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
Nick Piggin <[EMAIL PROTECTED]> wrote: > I think I should put wait_lock after wait_list, so as to get a better > packing on most 64-bit architectures. It makes no difference. struct lockdep_map contains at least one pointer and so is going to be 8-byte aligned (assuming it's there at all). struct rw_semaphore contains at least one pointer/long, so it will be padded out to 8-byte size. If you want to make a difference, you'd need to add __attribute__((packed)) but you would need to be very careful with that. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 12:53:49PM +0200, Andi Kleen wrote: > On Friday 13 April 2007 12:04:16 Nick Piggin wrote: > > OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches. > > > > Last time this came up I was asked to get some numbers, so here are > > some in the changelog, captured with a simple kernel module tester. > > I got motivated again because of the MySQL/glibc/mmap_sem issue. > > > > This patch converts all architectures to a generic rwsem implementation, > > which will compile down to the same code for i386, or powerpc, for > > example, and will allow some (eg. x86-64) to move away from spinlock > > based rwsems. > > > > Comments? > > Fine for me from the x86-64 side. Some more validation with a test suite > would be good though. David had a test suite somewhere, so I'll give that a run. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fastcalls, was Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 11:19:30AM +0100, Christoph Hellwig wrote: > On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote: > > Remove one level of indirection (kernel/rwsem.c -> lib/rwsem.c), and > > give a bit of a cleanup (eg remove the fastcall junk) to make the > > code a bit easier to read. > > Arpopos fastcalls, now that -mregparam=3 is the defaul on i386 and > FASTCALL/fastcall is a noop everywhere else can we please get rid > of this attribute entirely? If any other architecture provides > a more optimal non-standard calling convention we can just use > it by default in kernelspace as a few architectures already do. I don't see why not. David objected me removing them I think because the rwsem code called some from asm (but of course that's removed too). However AFAIK, this situation should be using the asmlinkage attribute anyway. fastcall in kernel is annoying... it seems to imply that we normally rather slow calls ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Friday 13 April 2007 12:04:16 Nick Piggin wrote: > OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches. > > Last time this came up I was asked to get some numbers, so here are > some in the changelog, captured with a simple kernel module tester. > I got motivated again because of the MySQL/glibc/mmap_sem issue. > > This patch converts all architectures to a generic rwsem implementation, > which will compile down to the same code for i386, or powerpc, for > example, and will allow some (eg. x86-64) to move away from spinlock > based rwsems. > > Comments? Fine for me from the x86-64 side. Some more validation with a test suite would be good though. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote: > OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches. > > Last time this came up I was asked to get some numbers, so here are > some in the changelog, captured with a simple kernel module tester. > I got motivated again because of the MySQL/glibc/mmap_sem issue. > > This patch converts all architectures to a generic rwsem implementation, > which will compile down to the same code for i386, or powerpc, for > example, and will allow some (eg. x86-64) to move away from spinlock > based rwsems. Oh, and it also converts powerpc and sparc64 to 64-bit counters, so they can handle more than 32K tasks waiting (which was apparently a real problem for SGI, and is probably a good thing). But that reminds me: > +/* > + * the semaphore definition > + */ > +struct rw_semaphore { > + atomic_long_t count; > + spinlock_t wait_lock; > + struct list_headwait_list; > +#ifdef CONFIG_DEBUG_LOCK_ALLOC > + struct lockdep_map dep_map; > +#endif > +}; I think I should put wait_lock after wait_list, so as to get a better packing on most 64-bit architectures. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
fastcalls, was Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote: > Remove one level of indirection (kernel/rwsem.c -> lib/rwsem.c), and > give a bit of a cleanup (eg remove the fastcall junk) to make the > code a bit easier to read. Arpopos fastcalls, now that -mregparam=3 is the defaul on i386 and FASTCALL/fastcall is a noop everywhere else can we please get rid of this attribute entirely? If any other architecture provides a more optimal non-standard calling convention we can just use it by default in kernelspace as a few architectures already do. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
fastcalls, was Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote: Remove one level of indirection (kernel/rwsem.c - lib/rwsem.c), and give a bit of a cleanup (eg remove the fastcall junk) to make the code a bit easier to read. Arpopos fastcalls, now that -mregparam=3 is the defaul on i386 and FASTCALL/fastcall is a noop everywhere else can we please get rid of this attribute entirely? If any other architecture provides a more optimal non-standard calling convention we can just use it by default in kernelspace as a few architectures already do. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote: OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches. Last time this came up I was asked to get some numbers, so here are some in the changelog, captured with a simple kernel module tester. I got motivated again because of the MySQL/glibc/mmap_sem issue. This patch converts all architectures to a generic rwsem implementation, which will compile down to the same code for i386, or powerpc, for example, and will allow some (eg. x86-64) to move away from spinlock based rwsems. Oh, and it also converts powerpc and sparc64 to 64-bit counters, so they can handle more than 32K tasks waiting (which was apparently a real problem for SGI, and is probably a good thing). But that reminds me: +/* + * the semaphore definition + */ +struct rw_semaphore { + atomic_long_t count; + spinlock_t wait_lock; + struct list_headwait_list; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +}; I think I should put wait_lock after wait_list, so as to get a better packing on most 64-bit architectures. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Friday 13 April 2007 12:04:16 Nick Piggin wrote: OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches. Last time this came up I was asked to get some numbers, so here are some in the changelog, captured with a simple kernel module tester. I got motivated again because of the MySQL/glibc/mmap_sem issue. This patch converts all architectures to a generic rwsem implementation, which will compile down to the same code for i386, or powerpc, for example, and will allow some (eg. x86-64) to move away from spinlock based rwsems. Comments? Fine for me from the x86-64 side. Some more validation with a test suite would be good though. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fastcalls, was Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 11:19:30AM +0100, Christoph Hellwig wrote: On Fri, Apr 13, 2007 at 12:04:16PM +0200, Nick Piggin wrote: Remove one level of indirection (kernel/rwsem.c - lib/rwsem.c), and give a bit of a cleanup (eg remove the fastcall junk) to make the code a bit easier to read. Arpopos fastcalls, now that -mregparam=3 is the defaul on i386 and FASTCALL/fastcall is a noop everywhere else can we please get rid of this attribute entirely? If any other architecture provides a more optimal non-standard calling convention we can just use it by default in kernelspace as a few architectures already do. I don't see why not. David objected me removing them I think because the rwsem code called some from asm (but of course that's removed too). However AFAIK, this situation should be using the asmlinkage attribute anyway. fastcall in kernel is annoying... it seems to imply that we normally rather slow calls ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 12:53:49PM +0200, Andi Kleen wrote: On Friday 13 April 2007 12:04:16 Nick Piggin wrote: OK, this patch is against 2.6.21-rc6 + Mathieu's atomic_long patches. Last time this came up I was asked to get some numbers, so here are some in the changelog, captured with a simple kernel module tester. I got motivated again because of the MySQL/glibc/mmap_sem issue. This patch converts all architectures to a generic rwsem implementation, which will compile down to the same code for i386, or powerpc, for example, and will allow some (eg. x86-64) to move away from spinlock based rwsems. Comments? Fine for me from the x86-64 side. Some more validation with a test suite would be good though. David had a test suite somewhere, so I'll give that a run. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
Nick Piggin [EMAIL PROTECTED] wrote: I think I should put wait_lock after wait_list, so as to get a better packing on most 64-bit architectures. It makes no difference. struct lockdep_map contains at least one pointer and so is going to be 8-byte aligned (assuming it's there at all). struct rw_semaphore contains at least one pointer/long, so it will be padded out to 8-byte size. If you want to make a difference, you'd need to add __attribute__((packed)) but you would need to be very careful with that. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
Nick Piggin [EMAIL PROTECTED] wrote: This patch converts all architectures to a generic rwsem implementation, which will compile down to the same code for i386, or powerpc, for example, and will allow some (eg. x86-64) to move away from spinlock based rwsems. Which are better on UP kernels because spinlocks degrade to nothing, and then you're left with a single disable/enable interrupt pair per operation, and no requirement for atomic ops at all. What you propose may wind up with several per op because if the CPU does not support atomic ops directly and cannot emulate them through other atomic ops, then these have to be emulated by: atomic_op() { spin_lock_irqsave do op spin_unlock_irqrestore } Move to an architecture independent rwsem implementation, using the better of the two rwsem implementations That's not necessarily the case, as I said above. Furthermore, the spinlock implementation struct is smaller on 64-bit machines, and is less prone to counter overrun on 32-bit machines. Out-of-line the fastpaths, to bring rw-semaphores into line with mutexes and spinlocks WRT our icache vs function call policy. That should depend on whether you optimise for space or for speed. Function calls are relatively heavyweight. Please re-inline and fix Ingo's mess if you must clean up. Take the i386 version, for instance, I'd made it so that the compiler didn't know it was taking a function call when it went down the slow path, thus meaning the compiler didn't have to deal with that. Furthermore, it only interpolated two or three instructions into the calling code in the fastpath. It's a real shame that gcc inline asm doesn't allow you to use status flags as boolean returns, otherwise I could reduce that even further. Spinlock based rwsems are inferior to atomic based ones one most architectures that can do atomic ops without spinlocks: Note the most in your statement... Think about it. This algorithm is only optimal where XADD is available. If you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better. If the only atomic op you have is XCHG, then this is a really poor choice; similarly if you are using a UP-compiled kernel. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 12:44:50PM +0100, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: I think I should put wait_lock after wait_list, so as to get a better packing on most 64-bit architectures. It makes no difference. struct lockdep_map contains at least one pointer and so is going to be 8-byte aligned (assuming it's there at all). struct rw_semaphore contains at least one pointer/long, so it will be padded out to 8-byte size. I hope people are not going to enabled lockdep on their production systems :) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
David, you keep saying the same things and don't listen to me. On Fri, Apr 13, 2007 at 01:09:42PM +0100, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: This patch converts all architectures to a generic rwsem implementation, which will compile down to the same code for i386, or powerpc, for example, and will allow some (eg. x86-64) to move away from spinlock based rwsems. Which are better on UP kernels because spinlocks degrade to nothing, and then you're left with a single disable/enable interrupt pair per operation, and no requirement for atomic ops at all. On UP, if an IRQ disable/enable pair operation is _faster_ than the atomic op, then the architecture can and should impelemnt atomic ops on UP by doing exactly that. What you propose may wind up with several per op because if the CPU does not support atomic ops directly and cannot emulate them through other atomic ops, then these have to be emulated by: atomic_op() { spin_lock_irqsave do op spin_unlock_irqrestore } Yes, this is the case on our 2 premiere SMP powerhouse architectures, sparc32 and parsic. The other way happens to be better for everyone else, which is why I think your suggestion to instead move everyone to the spinlock version was weird. Finally: as I said, even for those 2 architectures, this may not be so bad because it is using a different spinlock for the fastpath as it is for the slowpath. So your uncontested, cache cold case will get a little slower, but the contested case could improve a lot (eg. I saw an order of magnitude improvement). Move to an architecture independent rwsem implementation, using the better of the two rwsem implementations That's not necessarily the case, as I said above. Furthermore, the spinlock implementation struct is smaller on 64-bit machines, and is less prone to counter overrun on 32-bit machines. I think 64-bit machines will be happy to take the extra word it if they have double the single threaded performance, quadruple the parallel read performance, and 10 times the contested read performance. 32-bit machines might indeed overflow, but if it hasn't been a problem for i386 or (even 64-bit) powerpc yet, is it a real worry? Out-of-line the fastpaths, to bring rw-semaphores into line with mutexes and spinlocks WRT our icache vs function call policy. That should depend on whether you optimise for space or for speed. Function calls are relatively heavyweight. This is what spinlocks and mutexes do, and they're much more common than rwsems. I'm just trying to make it consistent, and you can muck around with it all you want after that. It is actually very easy to inline things now, unlike before my patch. Please re-inline and fix Ingo's mess if you must clean up. Take the i386 version, for instance, I'd made it so that the compiler didn't know it was taking a function call when it went down the slow path, thus meaning the compiler didn't have to deal with that. Furthermore, it only interpolated two or three instructions into the calling code in the fastpath. It's a real shame that gcc inline asm doesn't allow you to use status flags as boolean returns, otherwise I could reduce that even further. I cleaned it. Spinlock based rwsems are inferior to atomic based ones one most architectures that can do atomic ops without spinlocks: Note the most in your statement... Think about it. This algorithm is only optimal where XADD is available. If you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better. You keep saying this too, and I have thought about it but I couldn't think of a much better way. I'm not saying you're wrong, but why don't you just tell me what that better way is? If the only atomic op you have is XCHG, then this is a really poor choice; What is better? spinlocks? I think that considering only 2 dead archs really care at this stage, and I have good reason to believe that the contested case will be impreoved, then why don't you come up with some numbers to prove me wrong? similarly if you are using a UP-compiled kernel. Then your UP-compiled kernel's atomic ops are suboptimal, not the rwsem implementation. Anyway, thanks for taking the time again. If you would please address each of my points, then we might finally be able to stop having this discussion every 6 months ;) Nick - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 02:43:03PM +0200, Nick Piggin wrote: Yes, this is the case on our 2 premiere SMP powerhouse architectures, sparc32 and parsic. sparc32 is ultra-legacy and I have a tremendous amount of work to do on SMP there. I don't feel that efficiency of locking primitives is a crucial issue for sparc32. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
Nick Piggin [EMAIL PROTECTED] wrote: The other way happens to be better for everyone else, which is why I think your suggestion to instead move everyone to the spinlock version was weird. No, you misunderstand me. My preferred solution is to leave it up to the arch and not to make it generic, though I'm not averse to providing some prepackaged solutions for an arch to pick from if it wants to. Finally: as I said, even for those 2 architectures, this may not be so bad because it is using a different spinlock for the fastpath as it is for the slowpath. So your uncontested, cache cold case will get a little slower, but the contested case could improve a lot (eg. I saw an order of magnitude improvement). Agreed. I can see why the spinlock implementation is bad on SMP. By all means change those cases, and reduce the spinlock implementation to an interrupt disablement only version that may only be used on UP only. 32-bit machines might indeed overflow, but if it hasn't been a problem for i386 or (even 64-bit) powerpc yet, is it a real worry? It has happened, I believe. People have tried having 32766 threads on a 32-bit box. Mad it may be, but... This is what spinlocks and mutexes do, and they're much more common than rwsems. I'm just trying to make it consistent, and you can muck around with it all you want after that. It is actually very easy to inline things now, unlike before my patch. My original stuff was very easy to inline until Ingo got hold of it. Think about it. This algorithm is only optimal where XADD is available. If you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better. You keep saying this too, and I have thought about it but I couldn't think of a much better way. I'm not saying you're wrong, but why don't you just tell me what that better way is? Look at how the counter works in the spinlock case. With the addition of an extra flag in the counter to indicate there's stuff waiting on the queue, you can manipulate the counter if it appears safe to do so, otherwise you have to fall back to the slow path and take a spin lock. Break the counter down like this: 0x - not locked; queue empty 0x4000 - locked by writer; queue empty 0xc000 - locket by writer; queue occupied 0x0nnn - n readers; queue empty 0x8nnn - n readers; queue occupied Now here's a rough guide as to how the main operations would work: (*) down_read of unlocked cmpxchg(0 - 1) - 0 [okay - you've got the lock] (*) down_read of readlocked. cmpxchg(0 - 1) - n [failed to get the lock] do n = cmpxchg(old_n - old_n+1) until n == old_n (*) down_read of writelocked or contented readlocked. cmpxchg(0 - 1) - 0x8000|n [lock contended] goto slowpath spinlock try to get lock again if still contended mark counter contended add to queue spinunlock sleep spinunlock (*) down_write of unlocked cmpxchg(0 - 0x4000) - 0 [okay - you've got the lock] (*) down_write of locked, contended or otherwise cmpxchg(0 - 0x4000) - nz [failed] goto slowpath spinlock try to get lock again if still unavailable mark counter contended add to queue spinunlock sleep else spinunlock (*) up_read x = cmpxchg(1 - 0) if x == 0 done else do x = cmpxchg(old_x - old_x-1) until x == old_x if old_x == 0x8000 wake_up_writer (*) up_write x = cmpxchg(0x8000 - 0) if x == 0 done else wake_up_readers You can actually do better with LL/SC here because for the initial attempt with CMPXCHG in each case you have to guess what the numbers will be. Furthermore, you might actually be able to do an atomic increment or set contention flag operation. Note that the contention flag may only be set or cleared in the slow path whilst you are holding the spinlock. Doing down_read and up_read with CMPXCHG is a pain. XADD or LL/SC would be better, and LOCK INC/ADD/DEC/SUB won't work. You can't use XADD in down_*() as you may not change the bottom part of the counter if you're going to end up queuing. Actually, looking at it, it might be better to have the counter start off at 0x8000 for unlocked, no contention and clear the no-contention flag when you queue something. That way you can check for the counter becoming 0 in the up_*() functions as a trigger to go and invoke the slowpath. Then you could use LOCK DEC/SUB on i386 rather than XADD as you only need to check the Z flag. Note there is a slight window whereby a reader can jump a writer that's transiting between the fastpath part of down_write() and
Re: [patch] generic rwsems
On Fri, Apr 13, 2007 at 02:31:52PM +0100, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: The other way happens to be better for everyone else, which is why I think your suggestion to instead move everyone to the spinlock version was weird. No, you misunderstand me. My preferred solution is to leave it up to the arch and not to make it generic, though I'm not averse to providing some prepackaged solutions for an arch to pick from if it wants to. Just doesn't seem to be much payoff. I know you didn't think there is anything wrong with 2 different impleemtnations and hundreds of lines of arch specific assembly, but there is really little gain. Sure you might be able to optimise a few cycles off the i386 asm, but damn I just blitzed that sort of improvement on x86-64... and I still doubt it will be very noticable because rwsems don't get called like spinlocks. Finally: as I said, even for those 2 architectures, this may not be so bad because it is using a different spinlock for the fastpath as it is for the slowpath. So your uncontested, cache cold case will get a little slower, but the contested case could improve a lot (eg. I saw an order of magnitude improvement). Agreed. I can see why the spinlock implementation is bad on SMP. By all means change those cases, and reduce the spinlock implementation to an interrupt disablement only version that may only be used on UP only. So you missed my point about this above. If your UP atomic ops are slower than interrupt disabling, then implement the damn things using interrupt disabling instead of whatever it is you are using that is slower! ;) 32-bit machines might indeed overflow, but if it hasn't been a problem for i386 or (even 64-bit) powerpc yet, is it a real worry? It has happened, I believe. People have tried having 32766 threads on a 32-bit box. Mad it may be, but... Anyway, I doubt all the 32-bit archs using atomics would convert to the slower spinlocks, so maybe this just has to be a known issue. This is what spinlocks and mutexes do, and they're much more common than rwsems. I'm just trying to make it consistent, and you can muck around with it all you want after that. It is actually very easy to inline things now, unlike before my patch. My original stuff was very easy to inline until Ingo got hold of it. I agree that all the locking has turned pretty messy, but that isn't my fault. Think about it. This algorithm is only optimal where XADD is available. If you don't have XADD, but you do have LL/SC or CMPXCHG, you can do better. You keep saying this too, and I have thought about it but I couldn't think of a much better way. I'm not saying you're wrong, but why don't you just tell me what that better way is? Look at how the counter works in the spinlock case. With the addition of an extra flag in the counter to indicate there's stuff waiting on the queue, you can manipulate the counter if it appears safe to do so, otherwise you have to fall back to the slow path and take a spin lock. [snip] Ah, thanks. Yeah actually I remember you describing this at LCA, so I apologise for saying you didn't ;) Really, that isn't going to do much for performance (nothing as dramatic as the x86-64 spinlock-atomic conversion). However it will reduce the lock size by 8 bytes on 64-bit and fix the overflow on 32-bit... So why don't we implement this as the generic version? UP archs won't care because atomic_cmpxchg is generally just an interrupt disable, similarly for sparc and parisc. Most others except x86 do atomic_add_return with llsc or cas anyway, and even if the cmpxchg is a tiny bit slower for x86, at least it should be much faster than the spinlock version for x86-64, and will solve the overflow for i386. What do you say? similarly if you are using a UP-compiled kernel. Then your UP-compiled kernel's atomic ops are suboptimal, not the rwsem implementation. That's usually a function of the CPU designer. I mean your atomic_xxx functions are suboptimal for that design of CPU. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
On Fri, 13 Apr 2007 14:31:52 +0100 David Howells [EMAIL PROTECTED] wrote: Break the counter down like this: 0x - not locked; queue empty 0x4000 - locked by writer; queue empty 0xc000 - locket by writer; queue occupied 0x0nnn - n readers; queue empty 0x8nnn - n readers; queue occupied If space considerations are that important, we could then reserve one bit for the 'wait_lock spinlock' 0x2000 : one cpu gained control of 'wait_list' This would save 4 bytes on 32 bit platforms. 64 bit platforms could have a limit of 2^60 threads, instead of the way too small 2^28 one ;) (we loose the debug version of spinlock of course) Another possibility to save space would be to move wait_lock/wait_list outside of rw_semaphore, in a hashed global array. This would save 12/16 bytes per rw_semaphore (inode structs are probably the most demanding) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] generic rwsems
Eric Dumazet [EMAIL PROTECTED] wrote: If space considerations are that important, we could then reserve one bit for the 'wait_lock spinlock' That makes life quite a bit more tricky, though it does have the advantage that it closes the reader-jumping-writer window I mentioned. Another possibility to save space would be to move wait_lock/wait_list outside of rw_semaphore, in a hashed global array. I suspect moving wait_list out would be a bad idea. The ordering of things in the list is very important. You need to perform several operations on the list, all of which would be potentially slower: (1) glance at the first element of the list to see what sort of wake up to do (2) iteration of the list when waking up multiple readers (3) seeing if the list is empty (so you know that there's no more contention) Moving the spinlock out, on the other hand, might be worth it to cut down on cacheline bouncing some more... David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/