Re: [PATCH] 4th try: i386 rw_semaphores fix
Andrew Morton wrote: > It still doesn't compile with gcc-2.91.66 because of the "#define > rwsemdebug(FMT, ...)" thing. What can we do about this? Hmmm... It probably needs to be made conditional on the version of the compiler by means of the macros that hold the version numbers. > I cooked up a few more tests, generally beat the thing > up. This code works. Good stuff. I'll do some more testing > later this week - put some delays inside the semaphore code > itself to stretch the timing windows, but I don't see how > it can break it. Excellent. I tried to keep it as simple as possible to make it easier to test and follow. > Manfred says the rep;nop should come *before* the memory > operation, but I don't know if he's been heard... Probably... It's copied directly out of the spinlock stuff. I noticed it at the time, but it didn't stick in my memory. > The spin_lock_irqsave() in up_read/up_write shouldn't be > necessary plus it puts the readers onto the > waitqueue with add_wait_queue_exclusive, so an up_write > will only release one reader. Not so... have you looked at the new __wait_ctx_common() function? It actually ignores the exclusive flag since it uses other criteria as to whom to wake. All the add_wait_queue_exclusive() function does is put the process on the _back_ of the queue as far as rwsems are concerned. > Other architectures need work. > If they can live with a spinlock in the fastpath, then > the code at http://www.uow.edu.au/~andrewm/linux/rw_semaphore.tar.gz > is bog-simple and works. Sorry to pre-empt you, but have you seen my "advanced" patch? I sent it to the list in an email with the subject: [PATCH] i386 rw_semaphores, general abstraction patch I added a general fallback implementation with the inline implementation functions written wholly in C and using spinlocks. > Now I think we should specify the API a bit: Agreed... I'll think more in it on Tuesday, though... after Easter. Your points, however, look fairly reasonable. > Perhaps the writer-wakes-multiple > readers stuff isn't happening It does. Using my test program & module, try: driver -5 & sleep 1; driver 100 & Keep doing "ps", and you'll see all the reader processes jump into the S state at the same time, once all the writers have finished. You may need to increase the delay between the ups and downs in the module to see it clearly. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 4th try: i386 rw_semaphores fix
David Howells wrote: > > Here's the RW semaphore patch attempt #4. This fixes the bugs that Andrew > Morton's test cases showed up. > It still doesn't compile with gcc-2.91.66 because of the "#define rwsemdebug(FMT, ...)" thing. What can we do about this? I cooked up a few more tests, generally beat the thing up. This code works. Good stuff. I'll do some more testing later this week - put some delays inside the semaphore code itself to stretch the timing windows, but I don't see how it can break it. Manfred says the rep;nop should come *before* the memory operation, but I don't know if he's been heard... parisc looks OK. ppc tried hard, but didn't get it right. The spin_lock_irqsave() in up_read/up_write shouldn't be necessary plus it puts the readers onto the waitqueue with add_wait_queue_exclusive, so an up_write will only release one reader. Looks like they'll end up with stuck readers. Other architectures need work. If they can live with a spinlock in the fastpath, then the code at http://www.uow.edu.au/~andrewm/linux/rw_semaphore.tar.gz is bog-simple and works. Now I think we should specify the API a bit: * A task is not allowed to down_read() the same rwsem more than once time. Otherwise another task can do a down_write() in the middle and cause deadlock. (I don't think this has been specified for read_lock() and write_lock(). There's no such deadlock on the x86 implementation of these. Other architectures? Heaven knows. They're probably OK). * up_read() and up_write() may *not* be called from interrupt context. The latter is just a suggestion. It looks like your code is safe for interrupt-context upping - but it isn't tested for this. The guys who are going to support rwsems for architectures which have less rich atomic ops are going to *need* to know the rules here. Let's tell 'em. If we get agreement on these points, then please drop a comment in there and I believe we're done. In another email you said: > schedule() disabled: > reads taken: 585629 > writes taken: 292997 That's interesting. With two writer threads and four reader threads hammering the rwsem, the write/read grant ratio is 0.5003. Neat, but I'd have expected readers to do better. Perhaps the writer-wakes-multiple readers stuff isn't happening. I'll test for this. There was some discussion regarding contention rates a few days back. I did dome instrumentation on the normal semaphores. The highest contention rate I could observe was during a `make -j3 bzImage' on dual CPU. With this workload, down() enters __down() 2% of the time. That's a heck of a lot. It means that the semaphore slow path is by a very large margin the most expensive part. The most contention was on i_sem in real_lookup(), then pipe_sem and vfork_sem. ext2_lookup is slow; no news there. But tweaking the semaphore slowpath won't help here because when a process slept in __down(), it was always the only sleeper. With `dbench 12' the contention rate was much lower (0.01%). But when someone slept, there was almost always another sleeper, so the extra wake_up() at the end of __down() is worth attention. I don't have any decent multithread workloads which would allow me to form an opinion on whether we need to optimise the slowpath, or to more finely thread the contention areas. Could be with some workloads the contention is significant and threading is painful. I'll have another look at it sometime... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 4th try: i386 rw_semaphores fix
David Howells wrote: Here's the RW semaphore patch attempt #4. This fixes the bugs that Andrew Morton's test cases showed up. It still doesn't compile with gcc-2.91.66 because of the "#define rwsemdebug(FMT, ...)" thing. What can we do about this? I cooked up a few more tests, generally beat the thing up. This code works. Good stuff. I'll do some more testing later this week - put some delays inside the semaphore code itself to stretch the timing windows, but I don't see how it can break it. Manfred says the rep;nop should come *before* the memory operation, but I don't know if he's been heard... parisc looks OK. ppc tried hard, but didn't get it right. The spin_lock_irqsave() in up_read/up_write shouldn't be necessary plus it puts the readers onto the waitqueue with add_wait_queue_exclusive, so an up_write will only release one reader. Looks like they'll end up with stuck readers. Other architectures need work. If they can live with a spinlock in the fastpath, then the code at http://www.uow.edu.au/~andrewm/linux/rw_semaphore.tar.gz is bog-simple and works. Now I think we should specify the API a bit: * A task is not allowed to down_read() the same rwsem more than once time. Otherwise another task can do a down_write() in the middle and cause deadlock. (I don't think this has been specified for read_lock() and write_lock(). There's no such deadlock on the x86 implementation of these. Other architectures? Heaven knows. They're probably OK). * up_read() and up_write() may *not* be called from interrupt context. The latter is just a suggestion. It looks like your code is safe for interrupt-context upping - but it isn't tested for this. The guys who are going to support rwsems for architectures which have less rich atomic ops are going to *need* to know the rules here. Let's tell 'em. If we get agreement on these points, then please drop a comment in there and I believe we're done. In another email you said: schedule() disabled: reads taken: 585629 writes taken: 292997 That's interesting. With two writer threads and four reader threads hammering the rwsem, the write/read grant ratio is 0.5003. Neat, but I'd have expected readers to do better. Perhaps the writer-wakes-multiple readers stuff isn't happening. I'll test for this. There was some discussion regarding contention rates a few days back. I did dome instrumentation on the normal semaphores. The highest contention rate I could observe was during a `make -j3 bzImage' on dual CPU. With this workload, down() enters __down() 2% of the time. That's a heck of a lot. It means that the semaphore slow path is by a very large margin the most expensive part. The most contention was on i_sem in real_lookup(), then pipe_sem and vfork_sem. ext2_lookup is slow; no news there. But tweaking the semaphore slowpath won't help here because when a process slept in __down(), it was always the only sleeper. With `dbench 12' the contention rate was much lower (0.01%). But when someone slept, there was almost always another sleeper, so the extra wake_up() at the end of __down() is worth attention. I don't have any decent multithread workloads which would allow me to form an opinion on whether we need to optimise the slowpath, or to more finely thread the contention areas. Could be with some workloads the contention is significant and threading is painful. I'll have another look at it sometime... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] 4th try: i386 rw_semaphores fix
Here's the RW semaphore patch attempt #4. This fixes the bugs that Andrew Morton's test cases showed up. It simplifies the __wake_up_ctx_common() function and adds an iterative clause to the end of rwsem_wake(). David diff -uNr linux-2.4.3/arch/i386/config.in linux-rwsem/arch/i386/config.in --- linux-2.4.3/arch/i386/config.in Thu Apr 5 14:44:04 2001 +++ linux-rwsem/arch/i386/config.in Wed Apr 11 08:38:04 2001 @@ -47,11 +47,13 @@ if [ "$CONFIG_M386" = "y" ]; then define_bool CONFIG_X86_CMPXCHG n + define_bool CONFIG_X86_XADD n define_int CONFIG_X86_L1_CACHE_SHIFT 4 else define_bool CONFIG_X86_WP_WORKS_OK y define_bool CONFIG_X86_INVLPG y define_bool CONFIG_X86_CMPXCHG y + define_bool CONFIG_X86_XADD y define_bool CONFIG_X86_BSWAP y define_bool CONFIG_X86_POPAD_OK y fi diff -uNr linux-2.4.3/arch/i386/kernel/semaphore.c linux-rwsem/arch/i386/kernel/semaphore.c --- linux-2.4.3/arch/i386/kernel/semaphore.cThu Apr 5 14:38:34 2001 +++ linux-rwsem/arch/i386/kernel/semaphore.cWed Apr 11 22:30:59 2001 @@ -14,7 +14,6 @@ */ #include #include - #include /* @@ -179,6 +178,7 @@ * value.. */ asm( +".text\n" ".align 4\n" ".globl __down_failed\n" "__down_failed:\n\t" @@ -193,6 +193,7 @@ ); asm( +".text\n" ".align 4\n" ".globl __down_failed_interruptible\n" "__down_failed_interruptible:\n\t" @@ -205,6 +206,7 @@ ); asm( +".text\n" ".align 4\n" ".globl __down_failed_trylock\n" "__down_failed_trylock:\n\t" @@ -217,6 +219,7 @@ ); asm( +".text\n" ".align 4\n" ".globl __up_wakeup\n" "__up_wakeup:\n\t" @@ -232,197 +235,292 @@ asm( " +.text .align 4 .globl __down_read_failed __down_read_failed: pushl %edx pushl %ecx - jnc 2f - -3: calldown_read_failed_biased - -1: popl%ecx + calldown_read_failed + popl%ecx popl%edx ret - -2: calldown_read_failed - " LOCK "subl$1,(%eax) - jns 1b - jnc 2b - jmp 3b " ); asm( " +.text .align 4 .globl __down_write_failed __down_write_failed: pushl %edx pushl %ecx - jnc 2f - -3: calldown_write_failed_biased - -1: popl%ecx + calldown_write_failed + popl%ecx popl%edx ret - -2: calldown_write_failed - " LOCK "subl$" RW_LOCK_BIAS_STR ",(%eax) - jz 1b - jnc 2b - jmp 3b " ); -struct rw_semaphore *FASTCALL(rwsem_wake_readers(struct rw_semaphore *sem)); -struct rw_semaphore *FASTCALL(rwsem_wake_writer(struct rw_semaphore *sem)); +asm( +" +.text +.align 4 +.globl __rwsem_wake +__rwsem_wake: + pushl %edx + pushl %ecx + callrwsem_wake + popl%ecx + popl%edx + ret +" +); -struct rw_semaphore *FASTCALL(down_read_failed_biased(struct rw_semaphore *sem)); -struct rw_semaphore *FASTCALL(down_write_failed_biased(struct rw_semaphore *sem)); +struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *sem)); struct rw_semaphore *FASTCALL(down_read_failed(struct rw_semaphore *sem)); struct rw_semaphore *FASTCALL(down_write_failed(struct rw_semaphore *sem)); -struct rw_semaphore *down_read_failed_biased(struct rw_semaphore *sem) +/* + * implement exchange and add functionality + */ +static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem) { - struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); - - add_wait_queue(>wait, ); /* put ourselves at the head of the list */ + int tmp = delta; - for (;;) { - if (sem->read_bias_granted && xchg(>read_bias_granted, 0)) - break; - set_task_state(tsk, TASK_UNINTERRUPTIBLE); - if (!sem->read_bias_granted) - schedule(); - } - - remove_wait_queue(>wait, ); - tsk->state = TASK_RUNNING; +#ifndef CONFIG_USING_SPINLOCK_BASED_RWSEM + __asm__ __volatile__( + LOCK_PREFIX "xadd %0,(%1)" + : "+r"(tmp) + : "r"(sem) + : "memory"); + +#else + + __asm__ __volatile__( + "# beginning rwsem_atomic_update\n\t" +#ifdef CONFIG_SMP +LOCK_PREFIX" decb "RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* try to grab the +spinlock */ + " js3f\n" /* jump if failed */ + "1:\n\t" +#endif + " xchgl %0,(%1)\n\t" /* retrieve the old value */ + " addl %0,(%1)\n\t" /* add 0x0001, result in memory */ +#ifdef CONFIG_SMP + " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* release the +spinlock */ +#endif + ".section .text.lock,\"ax\"\n" +#ifdef CONFIG_SMP + "3:\n\t" /* spin on the spinlock till we get it */ + " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" + " rep;nop
[PATCH] 4th try: i386 rw_semaphores fix
Here's the RW semaphore patch attempt #4. This fixes the bugs that Andrew Morton's test cases showed up. It simplifies the __wake_up_ctx_common() function and adds an iterative clause to the end of rwsem_wake(). David diff -uNr linux-2.4.3/arch/i386/config.in linux-rwsem/arch/i386/config.in --- linux-2.4.3/arch/i386/config.in Thu Apr 5 14:44:04 2001 +++ linux-rwsem/arch/i386/config.in Wed Apr 11 08:38:04 2001 @@ -47,11 +47,13 @@ if [ "$CONFIG_M386" = "y" ]; then define_bool CONFIG_X86_CMPXCHG n + define_bool CONFIG_X86_XADD n define_int CONFIG_X86_L1_CACHE_SHIFT 4 else define_bool CONFIG_X86_WP_WORKS_OK y define_bool CONFIG_X86_INVLPG y define_bool CONFIG_X86_CMPXCHG y + define_bool CONFIG_X86_XADD y define_bool CONFIG_X86_BSWAP y define_bool CONFIG_X86_POPAD_OK y fi diff -uNr linux-2.4.3/arch/i386/kernel/semaphore.c linux-rwsem/arch/i386/kernel/semaphore.c --- linux-2.4.3/arch/i386/kernel/semaphore.cThu Apr 5 14:38:34 2001 +++ linux-rwsem/arch/i386/kernel/semaphore.cWed Apr 11 22:30:59 2001 @@ -14,7 +14,6 @@ */ #include linux/config.h #include linux/sched.h - #include asm/semaphore.h /* @@ -179,6 +178,7 @@ * value.. */ asm( +".text\n" ".align 4\n" ".globl __down_failed\n" "__down_failed:\n\t" @@ -193,6 +193,7 @@ ); asm( +".text\n" ".align 4\n" ".globl __down_failed_interruptible\n" "__down_failed_interruptible:\n\t" @@ -205,6 +206,7 @@ ); asm( +".text\n" ".align 4\n" ".globl __down_failed_trylock\n" "__down_failed_trylock:\n\t" @@ -217,6 +219,7 @@ ); asm( +".text\n" ".align 4\n" ".globl __up_wakeup\n" "__up_wakeup:\n\t" @@ -232,197 +235,292 @@ asm( " +.text .align 4 .globl __down_read_failed __down_read_failed: pushl %edx pushl %ecx - jnc 2f - -3: calldown_read_failed_biased - -1: popl%ecx + calldown_read_failed + popl%ecx popl%edx ret - -2: calldown_read_failed - " LOCK "subl$1,(%eax) - jns 1b - jnc 2b - jmp 3b " ); asm( " +.text .align 4 .globl __down_write_failed __down_write_failed: pushl %edx pushl %ecx - jnc 2f - -3: calldown_write_failed_biased - -1: popl%ecx + calldown_write_failed + popl%ecx popl%edx ret - -2: calldown_write_failed - " LOCK "subl$" RW_LOCK_BIAS_STR ",(%eax) - jz 1b - jnc 2b - jmp 3b " ); -struct rw_semaphore *FASTCALL(rwsem_wake_readers(struct rw_semaphore *sem)); -struct rw_semaphore *FASTCALL(rwsem_wake_writer(struct rw_semaphore *sem)); +asm( +" +.text +.align 4 +.globl __rwsem_wake +__rwsem_wake: + pushl %edx + pushl %ecx + callrwsem_wake + popl%ecx + popl%edx + ret +" +); -struct rw_semaphore *FASTCALL(down_read_failed_biased(struct rw_semaphore *sem)); -struct rw_semaphore *FASTCALL(down_write_failed_biased(struct rw_semaphore *sem)); +struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *sem)); struct rw_semaphore *FASTCALL(down_read_failed(struct rw_semaphore *sem)); struct rw_semaphore *FASTCALL(down_write_failed(struct rw_semaphore *sem)); -struct rw_semaphore *down_read_failed_biased(struct rw_semaphore *sem) +/* + * implement exchange and add functionality + */ +static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem) { - struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); - - add_wait_queue(sem-wait, wait); /* put ourselves at the head of the list */ + int tmp = delta; - for (;;) { - if (sem-read_bias_granted xchg(sem-read_bias_granted, 0)) - break; - set_task_state(tsk, TASK_UNINTERRUPTIBLE); - if (!sem-read_bias_granted) - schedule(); - } - - remove_wait_queue(sem-wait, wait); - tsk-state = TASK_RUNNING; +#ifndef CONFIG_USING_SPINLOCK_BASED_RWSEM + __asm__ __volatile__( + LOCK_PREFIX "xadd %0,(%1)" + : "+r"(tmp) + : "r"(sem) + : "memory"); + +#else + + __asm__ __volatile__( + "# beginning rwsem_atomic_update\n\t" +#ifdef CONFIG_SMP +LOCK_PREFIX" decb "RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* try to grab the +spinlock */ + " js3f\n" /* jump if failed */ + "1:\n\t" +#endif + " xchgl %0,(%1)\n\t" /* retrieve the old value */ + " addl %0,(%1)\n\t" /* add 0x0001, result in memory */ +#ifdef CONFIG_SMP + " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%1)\n\t" /* release the +spinlock */ +#endif + ".section .text.lock,\"ax\"\n" +#ifdef CONFIG_SMP + "3:\n\t" /* spin on the spinlock till we get it */ + " cmpb