Re: Fix PR51298, libgomp barrier failure
On Tue, Nov 29, 2011 at 04:15:36PM +1030, Alan Modra wrote: > On Mon, Nov 28, 2011 at 05:42:15PM -0800, Richard Henderson wrote: > > On 11/28/2011 06:02 AM, Alan Modra wrote: > > > - unsigned int ret = bar->generation & ~3; > > > - /* Do we need any barrier here or is __sync_add_and_fetch acting > > > - as the needed LoadLoad barrier already? */ > > > - ret += __sync_add_and_fetch (&bar->awaited, -1) == 0; > > > + unsigned int ret = __atomic_load_4 (&bar->generation, > > > MEMMODEL_ACQUIRE) & ~3; > > > + ret += __atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0; > > > > Given that the read from bar->generation is ACQ, we don't need a duplicate > > barrier from the REL on the atomic add. I believe both can be > > MEMMODEL_ACQUIRE > > both in order to force the ordering of these two memops, as well as force > > these > > to happen before anything subsequent. > > I tried with MEMMODEL_ACQUIRE and ran force-parallel-6.exe, the test > that seems most sensitive to barrier problems, many times, and it hangs > occasionally in futex_wait called via gomp_team_barrier_wait_end. > > I believe that threads can't be allowed to exit from > gomp_{,team_}barrier_wait without hitting a release barrier, and > perhaps from gomp_barrier_wait_last too. gomp_barrier_wait_start is a > convenient point to insert the barrier, and a minimal change from the > old code using __sync_add_and_fetch. I can add a comment. ;-) Committed rev 181833. -- Alan Modra Australia Development Lab, IBM
Re: Fix PR51298, libgomp barrier failure
On Mon, Nov 28, 2011 at 05:42:15PM -0800, Richard Henderson wrote: > On 11/28/2011 06:02 AM, Alan Modra wrote: > > - unsigned int ret = bar->generation & ~3; > > - /* Do we need any barrier here or is __sync_add_and_fetch acting > > - as the needed LoadLoad barrier already? */ > > - ret += __sync_add_and_fetch (&bar->awaited, -1) == 0; > > + unsigned int ret = __atomic_load_4 (&bar->generation, MEMMODEL_ACQUIRE) > > & ~3; > > + ret += __atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0; > > Given that the read from bar->generation is ACQ, we don't need a duplicate > barrier from the REL on the atomic add. I believe both can be > MEMMODEL_ACQUIRE > both in order to force the ordering of these two memops, as well as force > these > to happen before anything subsequent. I tried with MEMMODEL_ACQUIRE and ran force-parallel-6.exe, the test that seems most sensitive to barrier problems, many times, and it hangs occasionally in futex_wait called via gomp_team_barrier_wait_end. I believe that threads can't be allowed to exit from gomp_{,team_}barrier_wait without hitting a release barrier, and perhaps from gomp_barrier_wait_last too. gomp_barrier_wait_start is a convenient point to insert the barrier, and a minimal change from the old code using __sync_add_and_fetch. I can add a comment. ;-) -- Alan Modra Australia Development Lab, IBM
Re: Fix PR51298, libgomp barrier failure
On 11/28/2011 06:02 AM, Alan Modra wrote: > - unsigned int ret = bar->generation & ~3; > - /* Do we need any barrier here or is __sync_add_and_fetch acting > - as the needed LoadLoad barrier already? */ > - ret += __sync_add_and_fetch (&bar->awaited, -1) == 0; > + unsigned int ret = __atomic_load_4 (&bar->generation, MEMMODEL_ACQUIRE) & > ~3; > + ret += __atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0; Given that the read from bar->generation is ACQ, we don't need a duplicate barrier from the REL on the atomic add. I believe both can be MEMMODEL_ACQUIRE both in order to force the ordering of these two memops, as well as force these to happen before anything subsequent. The s/_4/_n/ change needs doing. Otherwise ok. r~
Re: Fix PR51298, libgomp barrier failure
On Mon, Nov 28, 2011 at 08:09:01PM +1030, Alan Modra wrote: > On Mon, Nov 28, 2011 at 09:13:27AM +0100, Jakub Jelinek wrote: > > On Mon, Nov 28, 2011 at 10:12:09AM +1030, Alan Modra wrote: > > > --- libgomp/config/linux/bar.c(revision 181718) > > > +++ libgomp/config/linux/bar.c(working copy) > > > @@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b > > >if (__builtin_expect ((state & 1) != 0, 0)) > > > { > > >/* Next time we'll be awaiting TOTAL threads again. */ > > > - bar->awaited = bar->total; > > > - atomic_write_barrier (); > > > - bar->generation += 4; > > > + __atomic_store_4 (&bar->awaited, bar->total, MEMMODEL_RELEASE); > > > + __atomic_add_fetch (&bar->generation, 4, MEMMODEL_ACQ_REL); > > >futex_wake ((int *) &bar->generation, INT_MAX); > > > } > > > > The above is unfortunate, bar->generation should be only modified from a > > single thread at this point, but the __atomic_add_fetch will enforce there > > an atomic instruction on it rather than just some kind of barrier. > > I will try without the atomic add and see what happens. Seems to be fine. I took out the extra write barriers too, so we now just have two MEMMODEL_RELEASE atomic store barriers replacing the two old atomic_write_barriers, and a number of new acquire barriers. PR libgomp/51298 * config/linux/bar.h: Use atomic rather than sync builtins. * config/linux/bar.c: Likewise. Add missing acquire synchronisation on generation field. * task.c (gomp_barrier_handle_tasks): Regain lock so as to not double unlock. Index: libgomp/task.c === --- libgomp/task.c (revision 181718) +++ libgomp/task.c (working copy) @@ -273,6 +273,7 @@ gomp_barrier_handle_tasks (gomp_barrier_ gomp_team_barrier_done (&team->barrier, state); gomp_mutex_unlock (&team->task_lock); gomp_team_barrier_wake (&team->barrier, 0); + gomp_mutex_lock (&team->task_lock); } } } Index: libgomp/config/linux/bar.h === --- libgomp/config/linux/bar.h (revision 181770) +++ libgomp/config/linux/bar.h (working copy) @@ -50,7 +50,7 @@ static inline void gomp_barrier_init (go static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned count) { - __sync_fetch_and_add (&bar->awaited, count - bar->total); + __atomic_add_fetch (&bar->awaited, count - bar->total, MEMMODEL_ACQ_REL); bar->total = count; } @@ -69,10 +69,8 @@ extern void gomp_team_barrier_wake (gomp static inline gomp_barrier_state_t gomp_barrier_wait_start (gomp_barrier_t *bar) { - unsigned int ret = bar->generation & ~3; - /* Do we need any barrier here or is __sync_add_and_fetch acting - as the needed LoadLoad barrier already? */ - ret += __sync_add_and_fetch (&bar->awaited, -1) == 0; + unsigned int ret = __atomic_load_4 (&bar->generation, MEMMODEL_ACQUIRE) & ~3; + ret += __atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0; return ret; } Index: libgomp/config/linux/bar.c === --- libgomp/config/linux/bar.c (revision 181770) +++ libgomp/config/linux/bar.c (working copy) @@ -37,17 +37,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b { /* Next time we'll be awaiting TOTAL threads again. */ bar->awaited = bar->total; - atomic_write_barrier (); - bar->generation += 4; + __atomic_store_4 (&bar->generation, bar->generation + 4, + MEMMODEL_RELEASE); futex_wake ((int *) &bar->generation, INT_MAX); } else { - unsigned int generation = state; - do - do_wait ((int *) &bar->generation, generation); - while (bar->generation == generation); + do_wait ((int *) &bar->generation, state); + while (__atomic_load_4 (&bar->generation, MEMMODEL_ACQUIRE) == state); } } @@ -81,15 +79,15 @@ gomp_team_barrier_wake (gomp_barrier_t * void gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) { - unsigned int generation; + unsigned int generation, gen; if (__builtin_expect ((state & 1) != 0, 0)) { /* Next time we'll be awaiting TOTAL threads again. */ struct gomp_thread *thr = gomp_thread (); struct gomp_team *team = thr->ts.team; + bar->awaited = bar->total; - atomic_write_barrier (); if (__builtin_expect (team->task_count, 0)) { gomp_barrier_handle_tasks (state); @@ -97,7 +95,7 @@ gomp_team_barrier_wait_end (gomp_barrier } else { - bar->generation = state + 3; + __atomic_store_4 (&bar->generation, state + 3, MEMMODEL_RELEASE); futex_wake ((int *) &bar->generation, INT_MAX); return; } @@ -107,12 +105,16 @@ gomp_team_barrier
Re: Fix PR51298, libgomp barrier failure
On Mon, Nov 28, 2011 at 09:13:27AM +0100, Jakub Jelinek wrote: > On Mon, Nov 28, 2011 at 10:12:09AM +1030, Alan Modra wrote: > > --- libgomp/config/linux/bar.c (revision 181718) > > +++ libgomp/config/linux/bar.c (working copy) > > @@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b > >if (__builtin_expect ((state & 1) != 0, 0)) > > { > >/* Next time we'll be awaiting TOTAL threads again. */ > > - bar->awaited = bar->total; > > - atomic_write_barrier (); > > - bar->generation += 4; > > + __atomic_store_4 (&bar->awaited, bar->total, MEMMODEL_RELEASE); > > + __atomic_add_fetch (&bar->generation, 4, MEMMODEL_ACQ_REL); > >futex_wake ((int *) &bar->generation, INT_MAX); > > } > > The above is unfortunate, bar->generation should be only modified from a > single thread at this point, but the __atomic_add_fetch will enforce there > an atomic instruction on it rather than just some kind of barrier. I will try without the atomic add and see what happens. -- Alan Modra Australia Development Lab, IBM
Re: Fix PR51298, libgomp barrier failure
On Mon, Nov 28, 2011 at 10:12:09AM +1030, Alan Modra wrote: > --- libgomp/config/linux/bar.c(revision 181718) > +++ libgomp/config/linux/bar.c(working copy) > @@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b >if (__builtin_expect ((state & 1) != 0, 0)) > { >/* Next time we'll be awaiting TOTAL threads again. */ > - bar->awaited = bar->total; > - atomic_write_barrier (); > - bar->generation += 4; > + __atomic_store_4 (&bar->awaited, bar->total, MEMMODEL_RELEASE); > + __atomic_add_fetch (&bar->generation, 4, MEMMODEL_ACQ_REL); >futex_wake ((int *) &bar->generation, INT_MAX); > } The above is unfortunate, bar->generation should be only modified from a single thread at this point, but the __atomic_add_fetch will enforce there an atomic instruction on it rather than just some kind of barrier. Jakub
Fix PR51298, libgomp barrier failure
This patch cures the remaining libgomp failures I see on power7. I'll admit to approaching this fix with a big hammer at first, liberally using MEMMODEL_SEQ_CST barriers all over bar.h and bar.c, then gradually reducing the number of places and strictness of the barriers. This may still have a few unnecessary barriers, but I'm not confident in removing any more. What led me to believe that barriers were missing in the current code was the nature of the test failures in the first place, and that the existing code uses atomic_write_barrier in two places between writing "awaited" and "generation", but does not have corresponding read barriers. Without read barriers processors are allowed to speculatively read "generation" ahead of time, leading to use of a stale value. PR libgomp/51298 * config/linux/bar.h: Use atomic rather than sync builtins. * config/linux/bar.c: Likewise. Add missing acquire and release synchronisation on generation field. * task.c (gomp_barrier_handle_tasks): Regain lock so as to not double unlock. Index: libgomp/task.c === --- libgomp/task.c (revision 181718) +++ libgomp/task.c (working copy) @@ -273,6 +273,7 @@ gomp_barrier_handle_tasks (gomp_barrier_ gomp_team_barrier_done (&team->barrier, state); gomp_mutex_unlock (&team->task_lock); gomp_team_barrier_wake (&team->barrier, 0); + gomp_mutex_lock (&team->task_lock); } } } Index: libgomp/config/linux/bar.h === --- libgomp/config/linux/bar.h (revision 181718) +++ libgomp/config/linux/bar.h (working copy) @@ -50,7 +50,7 @@ static inline void gomp_barrier_init (go static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned count) { - __sync_fetch_and_add (&bar->awaited, count - bar->total); + __atomic_add_fetch (&bar->awaited, count - bar->total, MEMMODEL_ACQ_REL); bar->total = count; } @@ -69,10 +69,8 @@ extern void gomp_team_barrier_wake (gomp static inline gomp_barrier_state_t gomp_barrier_wait_start (gomp_barrier_t *bar) { - unsigned int ret = bar->generation & ~3; - /* Do we need any barrier here or is __sync_add_and_fetch acting - as the needed LoadLoad barrier already? */ - ret += __sync_add_and_fetch (&bar->awaited, -1) == 0; + unsigned int ret = __atomic_load_4 (&bar->generation, MEMMODEL_ACQUIRE) & ~3; + ret += __atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0; return ret; } Index: libgomp/config/linux/bar.c === --- libgomp/config/linux/bar.c (revision 181718) +++ libgomp/config/linux/bar.c (working copy) @@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b if (__builtin_expect ((state & 1) != 0, 0)) { /* Next time we'll be awaiting TOTAL threads again. */ - bar->awaited = bar->total; - atomic_write_barrier (); - bar->generation += 4; + __atomic_store_4 (&bar->awaited, bar->total, MEMMODEL_RELEASE); + __atomic_add_fetch (&bar->generation, 4, MEMMODEL_ACQ_REL); futex_wake ((int *) &bar->generation, INT_MAX); } else { - unsigned int generation = state; - do - do_wait ((int *) &bar->generation, generation); - while (bar->generation == generation); + do_wait ((int *) &bar->generation, state); + while (__atomic_load_4 (&bar->generation, MEMMODEL_ACQUIRE) == state); } } @@ -81,15 +78,15 @@ gomp_team_barrier_wake (gomp_barrier_t * void gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) { - unsigned int generation; + unsigned int generation, gen; if (__builtin_expect ((state & 1) != 0, 0)) { /* Next time we'll be awaiting TOTAL threads again. */ struct gomp_thread *thr = gomp_thread (); struct gomp_team *team = thr->ts.team; - bar->awaited = bar->total; - atomic_write_barrier (); + + __atomic_store_4 (&bar->awaited, bar->total, MEMMODEL_RELEASE); if (__builtin_expect (team->task_count, 0)) { gomp_barrier_handle_tasks (state); @@ -97,7 +94,7 @@ gomp_team_barrier_wait_end (gomp_barrier } else { - bar->generation = state + 3; + __atomic_store_4 (&bar->generation, state + 3, MEMMODEL_RELEASE); futex_wake ((int *) &bar->generation, INT_MAX); return; } @@ -107,12 +104,14 @@ gomp_team_barrier_wait_end (gomp_barrier do { do_wait ((int *) &bar->generation, generation); - if (__builtin_expect (bar->generation & 1, 0)) + gen = __atomic_load_4 (&bar->generation, MEMMODEL_ACQUIRE); + if (__builtin_expect (gen & 1, 0)) gomp_barrier_handle_tasks (state); - if ((bar->generation & 2)) + gen = __atomic_load_4 (&bar->generation, MEMMODEL_ACQUIRE);