Re: [patch] powerpc: implement optimised mutex fastpaths
Nick Piggin writes: On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote: Implement a more optimal mutex fastpath for powerpc, making use of acquire and release barrier semantics. This takes the mutex lock+unlock benchmark from 203 to 173 cycles on a G5. +static inline int +__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *)) +{ + if (likely(__mutex_cmpxchg_lock(count, 1, 0) == 1)) + return 1; Oops, I must have sent the wrong version. This needs a return 0 here. Are you going to send the right version? (or did you send it and I missed it?) Also I note that the comment you added just before that routine says: + * Change the count from 1 to a value lower than 1, and return 0 (failure) + * if it wasn't 1 originally, or return 1 (success) otherwise. This function + * MUST leave the value lower than 1 even when the 1 assertion wasn't true. yet just doing a __mutex_cmpxchg_lock isn't going to do that. So please fix either the comment or the code (or both :). Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch] powerpc: implement optimised mutex fastpaths
On Thu, Nov 06, 2008 at 03:09:08PM +1100, Paul Mackerras wrote: Nick Piggin writes: On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote: Implement a more optimal mutex fastpath for powerpc, making use of acquire and release barrier semantics. This takes the mutex lock+unlock benchmark from 203 to 173 cycles on a G5. +static inline int +__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *)) +{ + if (likely(__mutex_cmpxchg_lock(count, 1, 0) == 1)) + return 1; Oops, I must have sent the wrong version. This needs a return 0 here. Are you going to send the right version? (or did you send it and I missed it?) Hmm, I thought I did but perhaps not. Will do... Also I note that the comment you added just before that routine says: + * Change the count from 1 to a value lower than 1, and return 0 (failure) + * if it wasn't 1 originally, or return 1 (success) otherwise. This function + * MUST leave the value lower than 1 even when the 1 assertion wasn't true. yet just doing a __mutex_cmpxchg_lock isn't going to do that. So please fix either the comment or the code (or both :). OK. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch] powerpc: implement optimised mutex fastpaths
On Mon, Oct 13, 2008 at 11:20:20AM -0500, Scott Wood wrote: On Mon, Oct 13, 2008 at 11:15:47AM -0500, Scott Wood wrote: On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote: +static inline int __mutex_cmpxchg_lock(atomic_t *v, int old, int new) +{ + int t; + + __asm__ __volatile__ ( +1: lwarx %0,0,%1 # mutex trylock\n\ + cmpw0,%0,%2\n\ + bne-2f\n + PPC405_ERR77(0,%1) +stwcx. %3,0,%1\n\ + bne-1b + ISYNC_ON_SMP + \n\ +2: + : =r (t) + : r (v-counter), r (old), r (new) + : cc, memory); This will break if the compiler picks r0 for v-counter. Use b as the constraint, or better yet do lwarx %0, %y1, with a Z (v-counter) constraint. Sorry, had a morning brain-fart there -- you're not using the base register. OK no problem. Do you think it looks OK? Thanks for reviewing... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch] powerpc: implement optimised mutex fastpaths
On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote: +static inline int __mutex_cmpxchg_lock(atomic_t *v, int old, int new) +{ + int t; + + __asm__ __volatile__ ( +1: lwarx %0,0,%1 # mutex trylock\n\ + cmpw0,%0,%2\n\ + bne-2f\n + PPC405_ERR77(0,%1) +stwcx. %3,0,%1\n\ + bne-1b + ISYNC_ON_SMP + \n\ +2: + : =r (t) + : r (v-counter), r (old), r (new) + : cc, memory); This will break if the compiler picks r0 for v-counter. Use b as the constraint, or better yet do lwarx %0, %y1, with a Z (v-counter) constraint. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch] powerpc: implement optimised mutex fastpaths
On Mon, Oct 13, 2008 at 11:15:47AM -0500, Scott Wood wrote: On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote: +static inline int __mutex_cmpxchg_lock(atomic_t *v, int old, int new) +{ + int t; + + __asm__ __volatile__ ( +1:lwarx %0,0,%1 # mutex trylock\n\ + cmpw0,%0,%2\n\ + bne-2f\n + PPC405_ERR77(0,%1) + stwcx. %3,0,%1\n\ + bne-1b + ISYNC_ON_SMP + \n\ +2: + : =r (t) + : r (v-counter), r (old), r (new) + : cc, memory); This will break if the compiler picks r0 for v-counter. Use b as the constraint, or better yet do lwarx %0, %y1, with a Z (v-counter) constraint. Sorry, had a morning brain-fart there -- you're not using the base register. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[patch] powerpc: implement optimised mutex fastpaths
Implement a more optimal mutex fastpath for powerpc, making use of acquire and release barrier semantics. This takes the mutex lock+unlock benchmark from 203 to 173 cycles on a G5. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/arch/powerpc/include/asm/mutex.h === --- linux-2.6.orig/arch/powerpc/include/asm/mutex.h +++ linux-2.6/arch/powerpc/include/asm/mutex.h @@ -1,9 +1,136 @@ /* - * Pull in the generic implementation for the mutex fastpath. + * Optimised mutex implementation of include/asm-generic/mutex-dec.h algorithm + */ +#ifndef _ASM_POWERPC_MUTEX_H +#define _ASM_POWERPC_MUTEX_H + +static inline int __mutex_cmpxchg_lock(atomic_t *v, int old, int new) +{ + int t; + + __asm__ __volatile__ ( +1:lwarx %0,0,%1 # mutex trylock\n\ + cmpw0,%0,%2\n\ + bne-2f\n + PPC405_ERR77(0,%1) + stwcx. %3,0,%1\n\ + bne-1b + ISYNC_ON_SMP + \n\ +2: + : =r (t) + : r (v-counter), r (old), r (new) + : cc, memory); + + return t; +} + +static inline int __mutex_dec_return_lock(atomic_t *v) +{ + int t; + + __asm__ __volatile__( +1:lwarx %0,0,%1 # mutex lock\n\ + addic %0,%0,-1\n + PPC405_ERR77(0,%1) + stwcx. %0,0,%1\n\ + bne-1b + ISYNC_ON_SMP + : =r (t) + : r (v-counter) + : cc, memory); + + return t; +} + +static inline int __mutex_inc_return_unlock(atomic_t *v) +{ + int t; + + __asm__ __volatile__( + LWSYNC_ON_SMP +1:lwarx %0,0,%1 # mutex unlock\n\ + addic %0,%0,1\n + PPC405_ERR77(0,%1) + stwcx. %0,0,%1 \n\ + bne-1b + : =r (t) + : r (v-counter) + : cc, memory); + + return t; +} + +/** + * __mutex_fastpath_lock - try to take the lock by moving the count + * from 1 to a 0 value + * @count: pointer of type atomic_t + * @fail_fn: function to call if the original value was not 1 + * + * Change the count from 1 to a value lower than 1, and call fail_fn if + * it wasn't 1 originally. This function MUST leave the value lower than + * 1 even when the 1 assertion wasn't true. + */ +static inline void +__mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *)) +{ + if (unlikely(__mutex_dec_return_lock(count) 0)) + fail_fn(count); +} + +/** + * __mutex_fastpath_lock_retval - try to take the lock by moving the count + * from 1 to a 0 value + * @count: pointer of type atomic_t + * @fail_fn: function to call if the original value was not 1 + * + * Change the count from 1 to a value lower than 1, and call fail_fn if + * it wasn't 1 originally. This function returns 0 if the fastpath succeeds, + * or anything the slow path function returns. + */ +static inline int +__mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *)) +{ + if (unlikely(__mutex_dec_return_lock(count) 0)) + return fail_fn(count); + return 0; +} + +/** + * __mutex_fastpath_unlock - try to promote the count from 0 to 1 + * @count: pointer of type atomic_t + * @fail_fn: function to call if the original value was not 0 + * + * Try to promote the count from 0 to 1. If it wasn't 0, call fail_fn. + * In the failure case, this function is allowed to either set the value to + * 1, or to set it to a value lower than 1. + */ +static inline void +__mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *)) +{ + if (unlikely(__mutex_inc_return_unlock(count) = 0)) + fail_fn(count); +} + +#define __mutex_slowpath_needs_to_unlock() 1 + +/** + * __mutex_fastpath_trylock - try to acquire the mutex, without waiting + * + * @count: pointer of type atomic_t + * @fail_fn: fallback function * - * TODO: implement optimized primitives instead, or leave the generic - * implementation in place, or pick the atomic_xchg() based generic - * implementation. (see asm-generic/mutex-xchg.h for details) + * Change the count from 1 to a value lower than 1, and return 0 (failure) + * if it wasn't 1 originally, or return 1 (success) otherwise. This function + * MUST leave the value lower than 1 even when the 1 assertion wasn't true. + * Additionally, if the value was 0 originally, this function must not leave + * it to 0 on failure. */ +static inline int +__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *)) +{ + if (likely(__mutex_cmpxchg_lock(count, 1, 0) == 1)) + return 1; +} -#include asm-generic/mutex-dec.h +#endif ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch] powerpc: implement optimised mutex fastpaths
On Sun, Oct 12, 2008 at 07:47:32AM +0200, Nick Piggin wrote: Implement a more optimal mutex fastpath for powerpc, making use of acquire and release barrier semantics. This takes the mutex lock+unlock benchmark from 203 to 173 cycles on a G5. +static inline int +__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *)) +{ + if (likely(__mutex_cmpxchg_lock(count, 1, 0) == 1)) + return 1; Oops, I must have sent the wrong version. This needs a return 0 here. +} ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev