Re: Use atomics in libgomp mutex
On Mon, Nov 28, 2011 at 05:33:17PM -0800, Richard Henderson wrote: On 11/27/2011 02:57 PM, Alan Modra wrote: This is the mutex part. Depends on http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02454.html for MEMMODEL_* values. * config/linux/mutex.h: Use atomic rather than sync builtins. * config/linux/mutex.c: Likewise. Comment. Use -1 for waiting state. * config/linux/omp-lock.h: Comment fix. * config/linux/arm/mutex.h: Delete. * config/linux/powerpc/mutex.h: Delete. * config/linux/ia64/mutex.h: Delete. * config/linux/mips/mutex.h: Delete. Looks good modulo _4/_n and using the success return of __atomic_compare_exchange. Did you see my comment about the OpenMP spec requirement that critical, ordered and parallel regions have a flush on entry and on exit? So should I commit a version with static inline void gomp_mutex_lock (gomp_mutex_t *mutex) { int oldval = 0; /* FIXME: This should just be MEMMODEL_ACQUIRE, MEMMODEL_RELAXED but GOMP_critical_start and other functions rely on the lock acquisition doing a flush. See OpenMP API version 3.1 section 2.8.6 flush Construct. */ if (__builtin_expect (!__atomic_compare_exchange_n (mutex, oldval, 1, false, MEMMODEL_ACQ_REL, MEMMODEL_ACQUIRE), 0)) gomp_mutex_lock_slow (mutex, oldval); } or as posted in http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02455.html? Perhaps with the following as well * ordered.c (gomp_ordered_sync): Add MEMMODEL_RELEASE fence. * critical.c (GOMP_critical_start): Likewise. I think we're OK on parallel regions by virtue of gomp_barrier_wait, but I may not know what I'm talking about. This stuff fries my brains. Index: ordered.c === --- ordered.c (revision 181770) +++ ordered.c (working copy) @@ -199,6 +199,9 @@ gomp_ordered_sync (void) if (team == NULL || team-nthreads == 1) return; + /* There is an implicit flush on entry to an ordered region. */ + __atomic_thread_fence (MEMMODEL_RELEASE); + /* ??? I believe it to be safe to access this data without taking the ws-lock. The only presumed race condition is with the previous thread on the queue incrementing ordered_cur such that it points Index: critical.c === --- critical.c (revision 181770) +++ critical.c (working copy) @@ -33,6 +33,8 @@ static gomp_mutex_t default_lock; void GOMP_critical_start (void) { + /* There is an implicit flush on entry to a critical region. */ + __atomic_thread_fence (MEMMODEL_RELEASE); gomp_mutex_lock (default_lock); } -- Alan Modra Australia Development Lab, IBM
Re: Use atomics in libgomp mutex
On 11/29/2011 04:27 AM, Alan Modra wrote: Did you see my comment about the OpenMP spec requirement that critical, ordered and parallel regions have a flush on entry and on exit? So should I commit a version with No, I missed that. Too many branches to this thread. ;-) /* FIXME: This should just be MEMMODEL_ACQUIRE, MEMMODEL_RELAXED but GOMP_critical_start and other functions rely on the lock acquisition doing a flush. See OpenMP API version 3.1 section 2.8.6 flush Construct. */ if (__builtin_expect (!__atomic_compare_exchange_n (mutex, oldval, 1, false, MEMMODEL_ACQ_REL, MEMMODEL_ACQUIRE), 0)) No, I think we should simply put other barriers elsewhere. + /* There is an implicit flush on entry to an ordered region. */ + __atomic_thread_fence (MEMMODEL_RELEASE); + /* ??? I believe it to be safe to access this data without taking the ws-lock. The only presumed race condition is with the previous thread on the queue incrementing ordered_cur such that it points How about MEMMODEL_ACQ_REL, after the comment about the lock, which would have done the flush? Given that we don't have a lock, and no ACQ barrier, perhaps the full barrier makes more sense? + /* There is an implicit flush on entry to a critical region. */ + __atomic_thread_fence (MEMMODEL_RELEASE); gomp_mutex_lock (default_lock); And, yes, this together with the ACQUIRE barrier inside the lock makes a full barrier. (Which also reminds me that for gcc 4.8 we should expose these barriers at both gimple and rtl levels and optimize them. We'll get two sequential lwsync-like insns here on any target where the barrier isn't included in the lock insn directly.) As for parallels, I agree the barrier implementation should flush all data. And I've found the menu for next week: http://www.masterchef.com.au/fried-brains-with-bacon-crumble-and-apple-vinaigrette.htm r~
Re: Use atomics in libgomp mutex
On Tue, Nov 29, 2011 at 08:37:07AM -0800, Richard Henderson wrote: How about MEMMODEL_ACQ_REL Right. Committed with that change revision 181832. -- Alan Modra Australia Development Lab, IBM
Re: Use atomics in libgomp mutex
On Mon, Nov 28, 2011 at 10:37:09AM +1030, Alan Modra wrote: On Mon, Nov 28, 2011 at 09:27:20AM +1030, Alan Modra wrote: This is the mutex part. Depends on http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02454.html for MEMMODEL_* values. Arrgh, I posted the wrong patch. I know it needs work elsewhere in libgomp to comply with the OMP spec, which says of flush: 14 A flush region without a list is implied at the following locations: 15 • During a barrier region. 16 • At entry to and exit from parallel, critical, and ordered regions. [snip] Yet COMP_critical_start simply calls gomp_mutex_lock, relying on the lock acquisition to do the flush. This won't necessarily happen with MEMMODEL_ACQUIRE. This bit in gomp_mutex_lock + __atomic_compare_exchange_4 (mutex, oldval, 1, false, + MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); should be __atomic_compare_exchange_4 (mutex, oldval, 1, false, MEMMODEL_ACQ_REL, MEMMODEL_ACQ_REL); and once you do that it's hard to justify the patch for stage3 as fixing a bug (unnecessary sync instructions emitted by __sync builtins). If it is just GOMP_critical_start, couldn't it use an extra barrier before resp. after it calls mutex_lock? Jakub
Re: Use atomics in libgomp mutex
On Mon, Nov 28, 2011 at 09:15:02AM +0100, Jakub Jelinek wrote: On Mon, Nov 28, 2011 at 10:37:09AM +1030, Alan Modra wrote: 14 A flush region without a list is implied at the following locations: 15 • During a barrier region. 16 • At entry to and exit from parallel, critical, and ordered regions. [snip] If it is just GOMP_critical_start, couldn't it use an extra barrier before resp. after it calls mutex_lock? I believe gomp_ordered_sync needs a barrier, and I'm unsure about GOMP_parallel_start but I think one is needed there too. Yes, putting the barriers where they are actually needed is the best solution, but I'm far from an OpenMP expert so am unsure whether these are the only places requiring a barrier. -- Alan Modra Australia Development Lab, IBM
Re: Use atomics in libgomp mutex
On 11/27/2011 02:57 PM, Alan Modra wrote: This is the mutex part. Depends on http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02454.html for MEMMODEL_* values. * config/linux/mutex.h: Use atomic rather than sync builtins. * config/linux/mutex.c: Likewise. Comment. Use -1 for waiting state. * config/linux/omp-lock.h: Comment fix. * config/linux/arm/mutex.h: Delete. * config/linux/powerpc/mutex.h: Delete. * config/linux/ia64/mutex.h: Delete. * config/linux/mips/mutex.h: Delete. Looks good modulo _4/_n and using the success return of __atomic_compare_exchange. r~
Use atomics in libgomp mutex
This is the mutex part. Depends on http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02454.html for MEMMODEL_* values. * config/linux/mutex.h: Use atomic rather than sync builtins. * config/linux/mutex.c: Likewise. Comment. Use -1 for waiting state. * config/linux/omp-lock.h: Comment fix. * config/linux/arm/mutex.h: Delete. * config/linux/powerpc/mutex.h: Delete. * config/linux/ia64/mutex.h: Delete. * config/linux/mips/mutex.h: Delete. Index: libgomp/config/linux/mutex.h === --- libgomp/config/linux/mutex.h(revision 181718) +++ libgomp/config/linux/mutex.h(working copy) @@ -33,39 +33,35 @@ typedef int gomp_mutex_t; #define GOMP_MUTEX_INIT_0 1 -static inline void gomp_mutex_init (gomp_mutex_t *mutex) +extern void gomp_mutex_lock_slow (gomp_mutex_t *mutex, int); +extern void gomp_mutex_unlock_slow (gomp_mutex_t *mutex); + +static inline void +gomp_mutex_init (gomp_mutex_t *mutex) { *mutex = 0; } -extern void gomp_mutex_lock_slow (gomp_mutex_t *mutex, int); -static inline void gomp_mutex_lock (gomp_mutex_t *mutex) +static inline void +gomp_mutex_destroy (gomp_mutex_t *mutex) { - int oldval = __sync_val_compare_and_swap (mutex, 0, 1); - if (__builtin_expect (oldval, 0)) -gomp_mutex_lock_slow (mutex, oldval); } -extern void gomp_mutex_unlock_slow (gomp_mutex_t *mutex); -static inline void gomp_mutex_unlock (gomp_mutex_t *mutex) +static inline void +gomp_mutex_lock (gomp_mutex_t *mutex) { - /* Warning: By definition __sync_lock_test_and_set() does not have - proper memory barrier semantics for a mutex unlock operation. - However, this default implementation is written assuming that it - does, which is true for some targets. - - Targets that require additional memory barriers before - __sync_lock_test_and_set to achieve the release semantics of - mutex unlock, are encouraged to include - config/linux/ia64/mutex.h in a target specific mutex.h instead - of using this file. */ - int val = __sync_lock_test_and_set (mutex, 0); - if (__builtin_expect (val 1, 0)) -gomp_mutex_unlock_slow (mutex); + int oldval = 0; + __atomic_compare_exchange_4 (mutex, oldval, 1, false, + MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); + if (__builtin_expect (oldval != 0, 0)) +gomp_mutex_lock_slow (mutex, oldval); } -static inline void gomp_mutex_destroy (gomp_mutex_t *mutex) +static inline void +gomp_mutex_unlock (gomp_mutex_t *mutex) { + int wait = __atomic_exchange_4 (mutex, 0, MEMMODEL_RELEASE); + if (__builtin_expect (wait 0, 0)) +gomp_mutex_unlock_slow (mutex); } - #endif /* GOMP_MUTEX_H */ Index: libgomp/config/linux/mutex.c === --- libgomp/config/linux/mutex.c(revision 181718) +++ libgomp/config/linux/mutex.c(working copy) @@ -34,25 +34,33 @@ long int gomp_futex_wait = FUTEX_WAIT | void gomp_mutex_lock_slow (gomp_mutex_t *mutex, int oldval) { + /* First loop spins a while. */ while (oldval == 1) { if (do_spin (mutex, 1)) { - oldval = __sync_lock_test_and_set (mutex, 2); + /* Spin timeout, nothing changed. Set waiting flag. */ + oldval = __atomic_exchange_4 (mutex, -1, MEMMODEL_ACQUIRE); if (oldval == 0) return; - futex_wait (mutex, 2); + futex_wait (mutex, -1); break; } else { - oldval = __sync_val_compare_and_swap (mutex, 0, 1); + /* Something changed. If now unlocked, we're good to go. */ + oldval = 0; + __atomic_compare_exchange_4 (mutex, oldval, 1, false, + MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); if (oldval == 0) return; } } - while ((oldval = __sync_lock_test_and_set (mutex, 2))) -do_wait (mutex, 2); + + /* Second loop waits until mutex is unlocked. We always exit this + loop with wait flag set, so next unlock will awaken a thread. */ + while ((oldval = __atomic_exchange_4 (mutex, -1, MEMMODEL_ACQUIRE))) +do_wait (mutex, -1); } void Index: libgomp/config/linux/omp-lock.h === --- libgomp/config/linux/omp-lock.h (revision 181718) +++ libgomp/config/linux/omp-lock.h (working copy) @@ -3,8 +3,8 @@ structures without polluting the namespace. When using the Linux futex primitive, non-recursive locks require - only one int. Recursive locks require we identify the owning task - and so require one int and a pointer. */ + one int. Recursive locks require we identify the owning task + and so require in addition one int and a pointer. */ typedef int omp_lock_t; typedef struct { int lock, count; void *owner; } omp_nest_lock_t; Index: libgomp/config/linux/arm/mutex.h
Re: Use atomics in libgomp mutex
On Mon, Nov 28, 2011 at 09:27:20AM +1030, Alan Modra wrote: This is the mutex part. Depends on http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02454.html for MEMMODEL_* values. Arrgh, I posted the wrong patch. I know it needs work elsewhere in libgomp to comply with the OMP spec, which says of flush: 14 A flush region without a list is implied at the following locations: 15 • During a barrier region. 16 • At entry to and exit from parallel, critical, and ordered regions. [snip] Yet COMP_critical_start simply calls gomp_mutex_lock, relying on the lock acquisition to do the flush. This won't necessarily happen with MEMMODEL_ACQUIRE. This bit in gomp_mutex_lock + __atomic_compare_exchange_4 (mutex, oldval, 1, false, +MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); should be __atomic_compare_exchange_4 (mutex, oldval, 1, false, MEMMODEL_ACQ_REL, MEMMODEL_ACQ_REL); and once you do that it's hard to justify the patch for stage3 as fixing a bug (unnecessary sync instructions emitted by __sync builtins). -- Alan Modra Australia Development Lab, IBM