Re: Fix libgomp semaphores

2011-11-29 Thread Alan Modra
On Tue, Nov 29, 2011 at 08:09:31AM -0800, Richard Henderson wrote:
> Tight loops like this should use the weak compare-exchange so that we don't 
> get two nested loops without need.

Done.

> I do think we need more commentary here so that next month I can still follow 
> it.

Added this in gomp_sem_post

  /* Clear SEM_WAIT here so that if there are no more waiting threads
 we transition back to the uncontended state that does not make
 futex syscalls.  If there are waiting threads then when one is
 awoken it will set SEM_WAIT again, so other waiting threads are
 woken on a future gomp_sem_post.  Furthermore, the awoken thread
 will wake other threads in case gomp_sem_post was called again
 before it had time to set SEM_WAIT.  */

and committed revision 181831 without likely_compare_exchange as
discussed.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Fix libgomp semaphores

2011-11-29 Thread Richard Henderson
On 11/29/2011 08:09 AM, Richard Henderson wrote:
>> +static inline bool
>> +likely_compare_exchange (int *ptr, int *expected, int val, bool weak,
>> + enum memmodel succ, enum memmodel fail)
>>  {
>> +  return __builtin_expect (__atomic_compare_exchange_n (ptr, expected, val,
>> +weak, succ, fail), 1);
>>  }
> 
> Please move this to libgomp.h just below the memmodel definition, as
> a macro so that it's not tied to int.  It seems likely that we'll
> want this elsewhere in libgomp as we convert things.

... although it does occur to me that it would be more useful to adjust 
gcc/predict.c so that we get this by default.  Then we don't have to contort 
loops like

> +  while (1)
> +if (likely_compare_exchange (sem, &count, ((count + SEM_INC) & 
> ~SEM_WAIT),
> +  false, MEMMODEL_RELEASE, MEMMODEL_RELAXED))
> +  break;

this.  This is surely more natural as

  while (!__atomic_compare_exchange_n (sem, &count, ...))
continue;


r~


Re: Fix libgomp semaphores

2011-11-29 Thread Richard Henderson
On 11/29/2011 04:49 AM, Alan Modra wrote:
> Since there were a number of changes requested, and some confusion
> over what was happening in this code, it might be better if I post
> a new patch and we start over.  In this patch I'm still using the MSB
> for the wait flag, and count in the low 31 bits, as that way is
> slightly more efficient on powerpc.  However, if some other target
> generates really bad code for the masking operations, then I'm happy
> enough to change the comment and use
> #define SEM_WAIT 1
> #define SEM_INC 2

Agreed we can wait for more data for this.  I think readability of the file 
with these two names is significantly better already.

> +static inline bool
> +likely_compare_exchange (int *ptr, int *expected, int val, bool weak,
> +  enum memmodel succ, enum memmodel fail)
>  {
> +  return __builtin_expect (__atomic_compare_exchange_n (ptr, expected, val,
> + weak, succ, fail), 1);
>  }

Please move this to libgomp.h just below the memmodel definition, as a macro so 
that it's not tied to int.  It seems likely that we'll want this elsewhere in 
libgomp as we convert things.

> +gomp_sem_wait (gomp_sem_t *sem)
>  {
> +  int count = *sem;
> +
> +  while ((count & ~SEM_WAIT) != 0)
> +if (likely_compare_exchange (sem, &count, count - SEM_INC,
> +  false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED))
> +  return;

Tight loops like this should use the weak compare-exchange so that we don't get 
two nested loops without need.

> +gomp_sem_post (gomp_sem_t *sem)
> +{
> +  int count = *sem;
> +
> +  while (1)
> +if (likely_compare_exchange (sem, &count, ((count + SEM_INC) & 
> ~SEM_WAIT),
> +  false, MEMMODEL_RELEASE, MEMMODEL_RELAXED))
> +  break;

Likewise.

And this morning I finally follow the logic.  I agree it's correct.  I do think 
we need more commentary here so that next month I can still follow it.

> +gomp_sem_wait_slow (gomp_sem_t *sem, int count)
>  {
> +  /* First loop spins a while.  */
> +  while (count == 0)
> +if (do_spin (sem, 0)
> + /* Spin timeout, nothing changed.  Set waiting flag.  */
> + && likely_compare_exchange (sem, &count, SEM_WAIT,
> + false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED))
> +  {

For the record, here I agree we should use the strong CAS, because the CAS is 
not the only thing in the loop.  We don't want a syscall just because the 
store-conditional failed.

Ok with those changes.


r~


Re: Fix libgomp semaphores

2011-11-29 Thread Alan Modra
Since there were a number of changes requested, and some confusion
over what was happening in this code, it might be better if I post
a new patch and we start over.  In this patch I'm still using the MSB
for the wait flag, and count in the low 31 bits, as that way is
slightly more efficient on powerpc.  However, if some other target
generates really bad code for the masking operations, then I'm happy
enough to change the comment and use
#define SEM_WAIT 1
#define SEM_INC 2

Tested both ways on powerpc-linux.

PR libgomp/51249
* config/linux/sem.h: Rewrite.
* config/linux/sem.c: Rewrite.

Index: libgomp/config/linux/sem.h
===
--- libgomp/config/linux/sem.h  (revision 181770)
+++ libgomp/config/linux/sem.h  (working copy)
@@ -1,4 +1,4 @@
-/* Copyright (C) 2005, 2009 Free Software Foundation, Inc.
+/* Copyright (C) 2005, 2009, 2011 Free Software Foundation, Inc.
Contributed by Richard Henderson .
 
This file is part of the GNU OpenMP Library (libgomp).
@@ -24,34 +24,65 @@
 
 /* This is a Linux specific implementation of a semaphore synchronization
mechanism for libgomp.  This type is private to the library.  This 
-   implementation uses atomic instructions and the futex syscall.  */
+   counting semaphore implementation uses atomic instructions and the
+   futex syscall, and a single 32-bit int to store semaphore state.
+   The low 31 bits are the count, the top bit is a flag set when some
+   threads may be waiting.  */
 
 #ifndef GOMP_SEM_H
 #define GOMP_SEM_H 1
 
+#include  /* For INT_MIN */
+
 typedef int gomp_sem_t;
+#define SEM_WAIT INT_MIN
+#define SEM_INC 1
+
+extern void gomp_sem_wait_slow (gomp_sem_t *, int);
+extern void gomp_sem_post_slow (gomp_sem_t *);
 
-static inline void gomp_sem_init (gomp_sem_t *sem, int value)
+static inline void
+gomp_sem_init (gomp_sem_t *sem, int value)
 {
-  *sem = value;
+  *sem = value * SEM_INC;
 }
 
-extern void gomp_sem_wait_slow (gomp_sem_t *);
-static inline void gomp_sem_wait (gomp_sem_t *sem)
+static inline void
+gomp_sem_destroy (gomp_sem_t *sem)
 {
-  if (!__sync_bool_compare_and_swap (sem, 1, 0))
-gomp_sem_wait_slow (sem);
 }
 
-extern void gomp_sem_post_slow (gomp_sem_t *);
-static inline void gomp_sem_post (gomp_sem_t *sem)
+static inline bool
+likely_compare_exchange (int *ptr, int *expected, int val, bool weak,
+enum memmodel succ, enum memmodel fail)
 {
-  if (!__sync_bool_compare_and_swap (sem, 0, 1))
-gomp_sem_post_slow (sem);
+  return __builtin_expect (__atomic_compare_exchange_n (ptr, expected, val,
+   weak, succ, fail), 1);
 }
 
-static inline void gomp_sem_destroy (gomp_sem_t *sem)
+static inline void
+gomp_sem_wait (gomp_sem_t *sem)
 {
+  int count = *sem;
+
+  while ((count & ~SEM_WAIT) != 0)
+if (likely_compare_exchange (sem, &count, count - SEM_INC,
+false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED))
+  return;
+  gomp_sem_wait_slow (sem, count);
 }
 
+static inline void
+gomp_sem_post (gomp_sem_t *sem)
+{
+  int count = *sem;
+
+  while (1)
+if (likely_compare_exchange (sem, &count, ((count + SEM_INC) & ~SEM_WAIT),
+false, MEMMODEL_RELEASE, MEMMODEL_RELAXED))
+  break;
+
+  if (__builtin_expect (count & SEM_WAIT, 0))
+gomp_sem_post_slow (sem);
+}
 #endif /* GOMP_SEM_H */
Index: libgomp/config/linux/sem.c
===
--- libgomp/config/linux/sem.c  (revision 181770)
+++ libgomp/config/linux/sem.c  (working copy)
@@ -1,4 +1,4 @@
-/* Copyright (C) 2005, 2008, 2009 Free Software Foundation, Inc.
+/* Copyright (C) 2005, 2008, 2009, 2011 Free Software Foundation, Inc.
Contributed by Richard Henderson .
 
This file is part of the GNU OpenMP Library (libgomp).
@@ -28,34 +28,56 @@
 
 #include "wait.h"
 
-
 void
-gomp_sem_wait_slow (gomp_sem_t *sem)
+gomp_sem_wait_slow (gomp_sem_t *sem, int count)
 {
+  /* First loop spins a while.  */
+  while (count == 0)
+if (do_spin (sem, 0)
+   /* Spin timeout, nothing changed.  Set waiting flag.  */
+   && likely_compare_exchange (sem, &count, SEM_WAIT,
+   false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED))
+  {
+   futex_wait (sem, SEM_WAIT);
+   count = *sem;
+   break;
+  }
+  /* Something changed.  If it wasn't the wait flag, we're good to go.  */
+else if (__builtin_expect (((count = *sem) & SEM_WAIT) == 0 && count != 0,
+  1))
+  {
+   if (likely_compare_exchange (sem, &count, count - SEM_INC,
+false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED))
+ return;
+  }
+
+  /* Second loop waits until semaphore is posted.  We always exit this
+ loop with wait flag set, so next post will awaken a thread.  */
   while (1)
 {
-  int val = __sync_val_c

Re: Fix libgomp semaphores

2011-11-28 Thread Alan Modra
On Mon, Nov 28, 2011 at 04:00:10PM -0800, Richard Henderson wrote:
> On 11/27/2011 02:53 PM, Alan Modra wrote:
> > +enum memmodel
> > +{
> > +  MEMMODEL_RELAXED = 0,
> > +  MEMMODEL_CONSUME = 1,
> > +  MEMMODEL_ACQUIRE = 2,
> > +  MEMMODEL_RELEASE = 3,
> > +  MEMMODEL_ACQ_REL = 4,
> > +  MEMMODEL_SEQ_CST = 5,
> > +  MEMMODEL_LAST = 6
> > +};
> 
> This should probably go to libgomp.h.

I wondered about that.

> >  /* This is a Linux specific implementation of a semaphore synchronization
> > mechanism for libgomp.  This type is private to the library.  This 
> > +   counting semaphore implementation uses atomic instructions and the
> > +   futex syscall, and a single 32-bit int to store semaphore state.
> > +   The low 31 bits are the count, the top bit is a flag set when some
> > +   threads may be waiting.  */
> 
> I think we'll get better code generation on a number of targets if we make 
> the low bit the waiting  bit, and the higher bits the count.  This we do 
> (count & 1) and (count + 2) instead of larger constants needing to be 
> generated.  Not changed below...

Funny, that's the way I wrote this code at first, then went for the
wait bit as the sign bit because you can test > 0 in one place where
you want "not waiting and count non-zero".

> > +static inline void
> > +gomp_sem_wait (gomp_sem_t *sem)
> >  {
> > +  int count = *sem;
> > +
> > +  while ((count & 0x7fff) != 0)
> > +{
> > +  int oldval = count;
> > +  __atomic_compare_exchange_4 (sem, &oldval, count - 1,
> > +  false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED);
> > +  if (__builtin_expect (oldval == count, 1))
> > +   return;
> > +  count = oldval;
> > +}
> 
> I'd really prefer not to hard-code any sizes at all.  Let's change the 
> explicit _4 to _n everywhere.

OK.

> The loop above doesn't actually make sense to me.  If the compare-and-swap 
> succeeds, then oldval == count - 1.  Which doesn't seem to be what you're 
> testing at all.

Huh?  If it succeeds, oldval == count and we return.

> Perhaps this entire function is better written as
> 
> {
>   int count = *sem;
>   do
> {
>   if (__builtin_expect (count & 0x8000u, 0)
> {
>   gomp_sem_wait_slow (sem, count);
>   return;
> }
> }
>   while (!__atomic_compare_exchange_n (sem, &count, count - 1, true,
>  MEMMODEL_ACQUIRE, MEMMODEL_RELAXED));
> }

No, we don't need the slow path if we have *sem & 0x7fff non-zero.

> 
> > +gomp_sem_post (gomp_sem_t *sem)
> >  {
> > +  int count = *sem;
> > +
> > +  while (1)
> > +{
> > +  int oldval = count;
> > +  __atomic_compare_exchange_4 (sem, &oldval, (count + 1) & 0x7fff,
> > +  false, MEMMODEL_RELEASE, MEMMODEL_RELAXED);
> > +  if (__builtin_expect (oldval == count, 1))
> > +   break;
> 
> Again, this equality doesn't make sense.  Further, you aren't testing for the
> high bit set inside the loop, nor are you preserving the high bit.

See above about the equality.  We don't want to preserve the wait bit
here.  Otherwise we'd never go back to the uncontended state, which
answers

> ... All that said, I don't see how we can ever clear the wait bit
> once its set?

this question.

> Are we better off splitting the 32-bit value into two 16-bit fields for 
> value+waiters?  Or similarly splitting a 64-bit value?  That way we can at 
> least update both fields atomically, and we ought never lose a waiter.

Other splits of the field are certainly possible, but of course
restrict the max number, and you'd need the fancy futex op_wait.
Some targets don't have 64-bit atomics, so we can't really go that
way.  Yes, if I did keep track of number of waiters we'd save one
unnecessary futex_wake call, but I'm quite confident no waiters will
be lost just using a flag.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Fix libgomp semaphores

2011-11-28 Thread Richard Henderson
On 11/28/2011 03:05 PM, Richard Henderson wrote:
> On 11/28/2011 02:16 PM, Alan Modra wrote:
>> Hmm, I suppose you could argue that powerpc and others ought to not
>> generate those three extra instructions when using the return value.
>> I'll see about fixing powerpc.
> 
> However, we can do better by considering the value to be stored in CR0...

Try this and see if it generates the sort of code you want.  Untested.


r~

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index f01353b..d3b990f 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -17352,11 +17352,11 @@ rs6000_expand_atomic_compare_and_swap (rtx operands[])
   retval = gen_reg_rtx (SImode);
   mode = SImode;
 }
+  else if (reg_overlap_mentioned_p (retval, oldval))
+oldval = copy_to_reg (oldval);
 
   rs6000_pre_atomic_barrier (mod_s);
 
-  emit_move_insn (boolval, const0_rtx);
-
   label1 = NULL_RTX;
   if (!is_weak)
 {
@@ -17374,28 +17374,23 @@ rs6000_expand_atomic_compare_and_swap (rtx operands[])
   NULL_RTX, 1, OPTAB_LIB_WIDEN);
 }
 
-  x = gen_rtx_NE (VOIDmode, x, oldval);
-  x = rs6000_generate_compare (x, mode);
+  cond = gen_reg_rtx (CCmode);
+  x = gen_rtx_COMPARE (CCmode, x, oldval);
+  emit_insn (gen_rtx_SET (VOIDmode, cond, x));
+
+  x = gen_rtx_NE (VOIDmode, cond, const0_rtx);
   emit_unlikely_jump (x, label2);
 
   x = newval;
   if (mask)
 x = rs6000_mask_atomic_subword (retval, newval, mask);
 
-  cond = gen_reg_rtx (CCmode);
   emit_store_conditional (mode, cond, mem, x);
 
-  if (is_weak)
-{
-  /* ??? It's either this or an unlikely jump over (set bool 1).  */
-  x = gen_rtx_EQ (SImode, cond, const0_rtx);
-  emit_insn (gen_rtx_SET (VOIDmode, boolval, x));
-}
-  else
+  if (!is_weak)
 {
   x = gen_rtx_NE (VOIDmode, cond, const0_rtx);
   emit_unlikely_jump (x, label1);
-  emit_move_insn (boolval, const1_rtx);
 }
 
   if (mod_f != MEMMODEL_RELAXED)
@@ -17408,6 +17403,18 @@ rs6000_expand_atomic_compare_and_swap (rtx operands[])
 
   if (shift)
 rs6000_finish_atomic_subword (operands[1], retval, shift);
+
+  if (is_weak)
+{
+  x = gen_rtx_EQ (SImode, cond, const0_rtx);
+  emit_insn (gen_rtx_SET (VOIDmode, boolval, x));
+}
+  else
+{
+  x = emit_store_flag_force (boolval, EQ, retval, oldval, mode, 1, 1);
+  if (x != boolval)
+   emit_move_insn (boolval, x);
+}
 }
 
 /* Expand an atomic exchange operation.  */


Re: Fix libgomp semaphores

2011-11-28 Thread Richard Henderson
On 11/27/2011 02:53 PM, Alan Modra wrote:
> +enum memmodel
> +{
> +  MEMMODEL_RELAXED = 0,
> +  MEMMODEL_CONSUME = 1,
> +  MEMMODEL_ACQUIRE = 2,
> +  MEMMODEL_RELEASE = 3,
> +  MEMMODEL_ACQ_REL = 4,
> +  MEMMODEL_SEQ_CST = 5,
> +  MEMMODEL_LAST = 6
> +};

This should probably go to libgomp.h.

>  /* This is a Linux specific implementation of a semaphore synchronization
> mechanism for libgomp.  This type is private to the library.  This 
> +   counting semaphore implementation uses atomic instructions and the
> +   futex syscall, and a single 32-bit int to store semaphore state.
> +   The low 31 bits are the count, the top bit is a flag set when some
> +   threads may be waiting.  */

I think we'll get better code generation on a number of targets if we make the 
low bit the waiting  bit, and the higher bits the count.  This we do (count & 
1) and (count + 2) instead of larger constants needing to be generated.  Not 
changed below...

> +static inline void
> +gomp_sem_wait (gomp_sem_t *sem)
>  {
> +  int count = *sem;
> +
> +  while ((count & 0x7fff) != 0)
> +{
> +  int oldval = count;
> +  __atomic_compare_exchange_4 (sem, &oldval, count - 1,
> +false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED);
> +  if (__builtin_expect (oldval == count, 1))
> + return;
> +  count = oldval;
> +}

I'd really prefer not to hard-code any sizes at all.  Let's change the explicit 
_4 to _n everywhere.

The loop above doesn't actually make sense to me.  If the compare-and-swap 
succeeds, then oldval == count - 1.  Which doesn't seem to be what you're 
testing at all.

Perhaps this entire function is better written as

{
  int count = *sem;
  do
{
  if (__builtin_expect (count & 0x8000u, 0)
{
  gomp_sem_wait_slow (sem, count);
  return;
}
}
  while (!__atomic_compare_exchange_n (sem, &count, count - 1, true,
   MEMMODEL_ACQUIRE, MEMMODEL_RELAXED));
}

> +gomp_sem_post (gomp_sem_t *sem)
>  {
> +  int count = *sem;
> +
> +  while (1)
> +{
> +  int oldval = count;
> +  __atomic_compare_exchange_4 (sem, &oldval, (count + 1) & 0x7fff,
> +false, MEMMODEL_RELEASE, MEMMODEL_RELAXED);
> +  if (__builtin_expect (oldval == count, 1))
> + break;

Again, this equality doesn't make sense.  Further, you aren't testing for the
high bit set inside the loop, nor are you preserving the high bit.

Perhaps better written as

{
  int count = *sem;
  do
{
  if (__builtin_expect (count & 0x8000u, 0))
{
  gomp_sem_post_slow (sem, count);
  return;
}
  /* We can't post more than 2**31-1 times.  */
  assert (count < 0x7fff);
}
  while (!__atomic_compare_exchange_n (sem, &count, count + 1, true,
   MEMMODEL_RELEASE, MEMMODEL_RELAXED));
}

... All that said, I don't see how we can ever clear the wait bit once its set? 
 Are we better off splitting the 32-bit value into two 16-bit fields for 
value+waiters?  Or similarly splitting a 64-bit value?  That way we can at 
least update both fields atomically, and we ought never lose a waiter.


r~


Re: Fix libgomp semaphores

2011-11-28 Thread Richard Henderson
On 11/28/2011 02:16 PM, Alan Modra wrote:
> Hmm, I suppose you could argue that powerpc and others ought to not
> generate those three extra instructions when using the return value.
> I'll see about fixing powerpc.

True.  For weak, the value *should* always be used (otherwise the user program 
is broken).

However, we can do better by considering the value to be stored in CR0.  We 
perform the comparison in the loop, which sets CR0 == EQ (true) or NE (false); 
CR0 is set again by the STWCX insn in the same fashion.  So the

  /* ??? It's either this or an unlikely jump over (set bool 1).  */
  x = gen_rtx_EQ (SImode, cond, const0_rtx);
  emit_insn (gen_rtx_SET (VOIDmode, boolval, x));

code currently emitted by rs6000_expand_atomic_compare_and_swap is sufficient, 
and merely needs to be adjusted so that it is computed after label2, and other 
settings to boolval removed.

This may be sufficient for better stong compare-and-swap as well.


r~


Re: Fix libgomp semaphores

2011-11-28 Thread Alan Modra
On Mon, Nov 28, 2011 at 05:23:37PM +0100, Jakub Jelinek wrote:
> On Mon, Nov 28, 2011 at 09:23:43AM +1030, Alan Modra wrote:
> > +  int count = *sem;
> > +
> > +  while ((count & 0x7fff) != 0)
> > +{
> > +  int oldval = count;
> > +  __atomic_compare_exchange_4 (sem, &oldval, count - 1,
> > +  false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED);
> > +  if (__builtin_expect (oldval == count, 1))
> > +   return;
> 
> Why aren't you instead testing the return value of
> __atomic_compare_exchange_4 (here and in other cases)?

If you use the return value on powerpc, you find that requires two
load immediate instructions (loading zero and one), and a compare
against zero.  That makes three fewer instructions as written, because
the oldval == count comparison has already been done inside the atomic
sequence.  I'd expect fewer on most other architectures unless they
happen to have a compare and exchange instruction that sets condition
bits (ie. Intel).  Even on Intel the way I've written the code
shouldn't take more instructions with a properly written cmpxchg rtl
description.  Does it?

Hmm, I suppose you could argue that powerpc and others ought to not
generate those three extra instructions when using the return value.
I'll see about fixing powerpc.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Fix libgomp semaphores

2011-11-28 Thread Jakub Jelinek
On Mon, Nov 28, 2011 at 09:23:43AM +1030, Alan Modra wrote:
> +  int count = *sem;
> +
> +  while ((count & 0x7fff) != 0)
> +{
> +  int oldval = count;
> +  __atomic_compare_exchange_4 (sem, &oldval, count - 1,
> +false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED);
> +  if (__builtin_expect (oldval == count, 1))
> + return;

Why aren't you instead testing the return value of
__atomic_compare_exchange_4 (here and in other cases)?


Jakub


Re: Fix libgomp semaphores

2011-11-27 Thread Alan Modra
On Fri, Nov 25, 2011 at 08:38:39AM +0100, Jakub Jelinek wrote:
> Furthermore, I'd prefer if the patch could be split into smaller parts,
> e.g. for bisecting purposes.  One patch would do the mutex changes
> to use new atomics, remove extra mutex.h headers and start using 0/1/-1
> instead of 0/1/2.  And another patch would rewrite the semaphores.

As requested, this is the semaphore rewrite.  The code is modelleled
on the existing mutex implementation, using a single int to hold 31
bits of count info and a wait flag bit.  Like the mutexes, these
semaphores have the nice property that no syscalls are made unless
threads are waiting on the semaphore, and if we reach the contended
case, there is an orderly transition back to uncontended.  Also, I
believe this implementation uses the bare minimum of synchronization
instructions.

Unlike the old semaphore implementation, I do not try to wake multiple
threads in the sem_post_slow futex_wake call.  That's because it is
necessary to wake further threads if possible on exiting from
sem_wait_slow, and given that is necessary, then waking multiple
threads just leads to unnecessary futex_wake syscalls.  You can see
why the wake in sem_wait_slow is necessary by considering a semaphore
set up to allow two threads to work on some job.  If there were four
threads trying to get the semaphore, two would succeed, A and B, and
two, C and D, end up waiting in gomp_set_wait_slow.  When A finishes
its job, it will call gomp_sem_post to increment the semaphore, see
and clear the wait flag, and call futex_wake on one thread.  If B also
finishes while this is happening, but before C has woken, then B's
call to gomp_sem_post may not see the wait flag set, leaving D asleep.

Incidentally, the regression I mentioned in
http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02393.html wasn't real.
(I'd been testing for failures with a script that repeatedly hammered
selected libgomp tests with
while LD_LIBRARY_PATH=blahblah ./sometest; do :; done
because a single gcc testsuite run often doesn't catch locking
failures.  The regression was due to a typo on LD_LIBRARY_PATH..)

Bootstrapped and regression tested powerpc-linux.  OK to apply?

PR libgomp/51249
* config/linux/sem.h: Rewrite.
* config/linux/sem.c: Rewrite.

Index: libgomp/config/linux/sem.h
===
--- libgomp/config/linux/sem.h  (revision 181718)
+++ libgomp/config/linux/sem.h  (working copy)
@@ -24,34 +24,74 @@
 
 /* This is a Linux specific implementation of a semaphore synchronization
mechanism for libgomp.  This type is private to the library.  This 
-   implementation uses atomic instructions and the futex syscall.  */
+   counting semaphore implementation uses atomic instructions and the
+   futex syscall, and a single 32-bit int to store semaphore state.
+   The low 31 bits are the count, the top bit is a flag set when some
+   threads may be waiting.  */
 
 #ifndef GOMP_SEM_H
 #define GOMP_SEM_H 1
 
 typedef int gomp_sem_t;
 
-static inline void gomp_sem_init (gomp_sem_t *sem, int value)
+enum memmodel
+{
+  MEMMODEL_RELAXED = 0,
+  MEMMODEL_CONSUME = 1,
+  MEMMODEL_ACQUIRE = 2,
+  MEMMODEL_RELEASE = 3,
+  MEMMODEL_ACQ_REL = 4,
+  MEMMODEL_SEQ_CST = 5,
+  MEMMODEL_LAST = 6
+};
+
+extern void gomp_sem_wait_slow (gomp_sem_t *, int);
+extern void gomp_sem_post_slow (gomp_sem_t *);
+
+static inline void
+gomp_sem_init (gomp_sem_t *sem, int value)
 {
   *sem = value;
 }
 
-extern void gomp_sem_wait_slow (gomp_sem_t *);
-static inline void gomp_sem_wait (gomp_sem_t *sem)
+static inline void
+gomp_sem_destroy (gomp_sem_t *sem)
 {
-  if (!__sync_bool_compare_and_swap (sem, 1, 0))
-gomp_sem_wait_slow (sem);
 }
 
-extern void gomp_sem_post_slow (gomp_sem_t *);
-static inline void gomp_sem_post (gomp_sem_t *sem)
+static inline void
+gomp_sem_wait (gomp_sem_t *sem)
 {
-  if (!__sync_bool_compare_and_swap (sem, 0, 1))
-gomp_sem_post_slow (sem);
+  int count = *sem;
+
+  while ((count & 0x7fff) != 0)
+{
+  int oldval = count;
+  __atomic_compare_exchange_4 (sem, &oldval, count - 1,
+  false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED);
+  if (__builtin_expect (oldval == count, 1))
+   return;
+  count = oldval;
+}
+  gomp_sem_wait_slow (sem, count);
 }
 
-static inline void gomp_sem_destroy (gomp_sem_t *sem)
+static inline void
+gomp_sem_post (gomp_sem_t *sem)
 {
-}
+  int count = *sem;
+
+  while (1)
+{
+  int oldval = count;
+  __atomic_compare_exchange_4 (sem, &oldval, (count + 1) & 0x7fff,
+  false, MEMMODEL_RELEASE, MEMMODEL_RELAXED);
+  if (__builtin_expect (oldval == count, 1))
+   break;
+  count = oldval;
+}
 
+  if (__builtin_expect (count & 0x8000, 0))
+gomp_sem_post_slow (sem);
+}
 #endif /* GOMP_SEM_H */
Index: libgomp/config/linux/sem.c
===
--- 

Re: Fix libgomp semaphores

2011-11-25 Thread Alan Modra
On Fri, Nov 25, 2011 at 08:38:39AM +0100, Jakub Jelinek wrote:
> My preference would be to avoid the abstraction changes though, both
> because it is additional clutter in the changeset and because omp_lock
> and nested lock are part of public ABIs, so if struct is layed out
> differently on some weird architecture, it would be an ABI change.

OK, fair enough.  I didn't consider that structs may be laid out
differently.

> So, if you could keep gomp_mutex_t, omp_lock_t and gomp_sem_t as integers,
> it would be appreciated.
> 
> Furthermore, I'd prefer if the patch could be split into smaller parts,
> e.g. for bisecting purposes.  One patch would do the mutex changes
> to use new atomics, remove extra mutex.h headers and start using 0/1/-1
> instead of 0/1/2.  And another patch would rewrite the semaphores.

OK.  I need to do this anyway as I just discovered a regression when
looping on one of the tests.  I suspect the acquire/release mutex
locking may have exposed bugs elsewhere in libgomp that were covered
by the heavyweight locking used by the __sync builtins.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Fix libgomp semaphores

2011-11-24 Thread Jakub Jelinek
On Fri, Nov 25, 2011 at 10:33:28AM +1030, Alan Modra wrote:
> This fixes PR51249, a failure caused by insufficient care in waking
> threads on sem_post.  It's quite a tricky business to get right, but I
> believe this rewrite is now correct.  I've also converted over lock.c,
> mutex.h and mutex.c to use the new atomic builtins.  This means no
> target should need target-specific versions of mutex.h.  mutex-lock.h
> came about because at one stage I was trying out semaphore and lock
> implementations that used a lock/count int and an nwaiters int to
> track number of threads waiting.  That turned out to be a really bad
> idea.  You need barriers all over the place to ensure you don't see an
> inconsistent lock/semaphore state, and extra barriers are costly.
> It's much better to pay the price of an extra futex_wake system call
> when transitioning from contended to uncontended.
> Anyway, I left the lock abstraction as a good thing.  I also changed
> the mutex implementation to use -1 to mean "locked and (possibly) some
> thread waiting on the lock" rather than using 2.  The very minor
> benefit is that some targets may use one less instruction testing < 0
> compared to testing > 1.  It's also just that little bit more like the
> semaphore implementation.
> 
> Bootstrapped and regression tested powerpc64-linux.  I do still see
> the occasional testsuite failures on power7 (see pr51298), but the
> semaphore failures are gone.

Thanks for working on it.
My preference would be to avoid the abstraction changes though, both
because it is additional clutter in the changeset and because omp_lock
and nested lock are part of public ABIs, so if struct is layed out
differently on some weird architecture, it would be an ABI change.
So, if you could keep gomp_mutex_t, omp_lock_t and gomp_sem_t as integers,
it would be appreciated.

Furthermore, I'd prefer if the patch could be split into smaller parts,
e.g. for bisecting purposes.  One patch would do the mutex changes
to use new atomics, remove extra mutex.h headers and start using 0/1/-1
instead of 0/1/2.  And another patch would rewrite the semaphores.

Jakub