https://gcc.gnu.org/g:304d08fea9e68c5b4806532816b58d33a15a917d

commit r16-6913-g304d08fea9e68c5b4806532816b58d33a15a917d
Author: Matthew Malcolmson <[email protected]>
Date:   Tue Jan 20 03:54:51 2026 +0000

    libgomp: Ensure memory sync after performing tasks
    
    As described in PR 122356 there is a theoretical bug around not
    "publishing" user data written in a task when that task has been
    executed by a thread after entry to a barrier.
    
    Key points of the C memory model that are relevant:
    1) Memory writes can be seen in a different order in different threads.
    2) When one thread (A) reads a value with acquire memory ordering that
       another thread (B) has written with release memory ordering, then all
       data written in thread (B) before the write that set this value will
       be visible to thread (A) after that read.
    3) This point requires that the read and write operate on the same
       value.  The guarantee is one-way:  It specifies that thread (A) will
       see the writes that thread (B) has performed before the specified
       write.  It does not specify that thread (B) will see writes that
       thread (A) has performed before reading this value.
    
    Outline of the issue:
    1) While there is a memory sync at entry to the barrier, user code can
       be ran after threads have all entered the barrier.
    2) There are various points where a memory sync can occur after entry to
       the barrier:
       - One thread getting the `task_lock` mutex that another thread has
         released.
       - Last thread incrementing `bar->generation` with `MEMMODEL_RELEASE`
         and some other thread reading it with `MEMMODEL_ACQUIRE`.
       However there are code paths that can avoid these points.
    3) On the code-paths that can avoid these points we could have no memory
       synchronisation between a write to user data that happened in a task
       executed after entry to the barrier, and some other thread running
       the implicit task after the barrier.  Hence that "other thread" may
       read a stale value that should have been overwritten in the explicit
       task.
    
    There are two code-paths that I believe I've identified:
    1) The last thread sees `task_count == 0` and increments the generation
       with `MEMMODEL_RELEASE` before continuing on to the next implicit
       task.
       If some other thread had executed a task that wrote user data I
       don't see any way in which an acquire-release ordering *from* the
       thread writing user data *to* the last thread would have been formed.
    2) After all threads have entered the barrier.  Some thread (A) is
       waiting in `do_wait`.  Some other thread (B) completes a task writing
       user data.  Thread (B) increments the generation using
       `gomp_team_barrier_done` (non atomically -- hence not allowing the
       formation of any acquire-release ordering with this write).  Thread
       (A) reads that data with `MEMMODEL_ACQUIRE`, but since the write was
       not atomic that does not form an ordering.
    
    This patch makes two changes:
    1) The write of `task_count == 0` in `gomp_barrier_handle_tasks` is done
       atomically while the read of `task_count` in
       `gomp_team_barrier_wait_end` is also made atomic.  This addresses the
       first case by forming an acquire-release ordering *from* the thread
       executing tasks *to* the thread that will increment the generation
       and continue.
    2) The write of `bar->generation` via `gomp_team_barrier_done` called
       from `gomp_barrier_handle_tasks` is done atomically.  This means that
       it will form an acquire-release synchronisation with the existing
       atomic read of `bar->generation` in the main loop of
       `gomp_team_barrier_wait_end`.
    
    Testing done:
    - Bootstrap & regtest on aarch64 and x86_64.
      - With & without _LIBGOMP_CHECKING_.
      - Testsuite with & without OMP_WAIT_POLICY=passive
    - Cross compilation & regtest on arm.
    - TSAN done on this as part of all my upstream patches.
    
    libgomp/ChangeLog:
            PR libgomp/122356
            * config/gcn/bar.c (gomp_team_barrier_wait_end): Atomically read
            team->task_count.
            (gomp_team_barrier_wait_cancel_end): Likewise.
            * config/gcn/bar.h (gomp_team_barrier_done): Atomically write
            bar->generation.
            * config/linux/bar.c (gomp_team_barrier_wait_end): Atomically
            read team->task_count.
            (gomp_team_barrier_wait_cancel_end): Likewise.
            * config/linux/bar.h (gomp_team_barrier_done): Atomically write
            bar->generation.
            * config/posix/bar.c (gomp_team_barrier_wait_end): Atomically
            read team->task_count.
            (gomp_team_barrier_wait_cancel_end): Likewise.
            * config/posix/bar.h (gomp_team_barrier_done): Atomically write
            bar->generation.
            * config/rtems/bar.h (gomp_team_barrier_done): Atomically write
            bar->generation.
            * task.c (gomp_barrier_handle_tasks): Atomically write
            team->task_count when decrementing to zero.
            * testsuite/libgomp.c/pr122356.c: New test.
    
    Signed-off-by: Matthew Malcomson <[email protected]>

Diff:
---
 libgomp/config/gcn/bar.c               |  8 +++++--
 libgomp/config/gcn/bar.h               |  5 ++++-
 libgomp/config/linux/bar.c             |  8 +++++--
 libgomp/config/linux/bar.h             |  5 ++++-
 libgomp/config/posix/bar.c             |  8 +++++--
 libgomp/config/posix/bar.h             |  5 ++++-
 libgomp/config/rtems/bar.h             |  5 ++++-
 libgomp/task.c                         |  8 ++++++-
 libgomp/testsuite/libgomp.c/pr122356.c | 40 ++++++++++++++++++++++++++++++++++
 9 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/libgomp/config/gcn/bar.c b/libgomp/config/gcn/bar.c
index 10c3f5d13623..3045587f0f3c 100644
--- a/libgomp/config/gcn/bar.c
+++ b/libgomp/config/gcn/bar.c
@@ -89,7 +89,9 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state)
 
       bar->awaited = bar->total;
       team->work_share_cancelled = 0;
-      if (__builtin_expect (team->task_count, 0))
+      unsigned task_count
+       = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
+      if (__builtin_expect (task_count, 0))
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
@@ -164,7 +166,9 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
 
       bar->awaited = bar->total;
       team->work_share_cancelled = 0;
-      if (__builtin_expect (team->task_count, 0))
+      unsigned task_count
+       = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
+      if (__builtin_expect (task_count, 0))
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
diff --git a/libgomp/config/gcn/bar.h b/libgomp/config/gcn/bar.h
index 0507efb7d2d1..6e838ff54a89 100644
--- a/libgomp/config/gcn/bar.h
+++ b/libgomp/config/gcn/bar.h
@@ -162,7 +162,10 @@ gomp_team_barrier_cancelled (gomp_barrier_t *bar)
 static inline void
 gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t state)
 {
-  bar->generation = (state & -BAR_INCR) + BAR_INCR;
+  /* Need the atomic store for acquire-release synchronisation with the
+     load in `gomp_team_barrier_wait_{cancel_,}end`.  See PR112356  */
+  __atomic_store_n (&bar->generation, (state & -BAR_INCR) + BAR_INCR,
+                   MEMMODEL_RELEASE);
 }
 
 static inline bool
diff --git a/libgomp/config/linux/bar.c b/libgomp/config/linux/bar.c
index 2a1b052b11e6..bbdfc8963918 100644
--- a/libgomp/config/linux/bar.c
+++ b/libgomp/config/linux/bar.c
@@ -90,7 +90,9 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state)
 
       bar->awaited = bar->total;
       team->work_share_cancelled = 0;
-      if (__builtin_expect (team->task_count, 0))
+      unsigned task_count
+       = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
+      if (__builtin_expect (task_count, 0))
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
@@ -154,7 +156,9 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
 
       bar->awaited = bar->total;
       team->work_share_cancelled = 0;
-      if (__builtin_expect (team->task_count, 0))
+      unsigned task_count
+       = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
+      if (__builtin_expect (task_count, 0))
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
diff --git a/libgomp/config/linux/bar.h b/libgomp/config/linux/bar.h
index b1fff01105a7..4dc0d3cca994 100644
--- a/libgomp/config/linux/bar.h
+++ b/libgomp/config/linux/bar.h
@@ -162,7 +162,10 @@ gomp_team_barrier_cancelled (gomp_barrier_t *bar)
 static inline void
 gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t state)
 {
-  bar->generation = (state & -BAR_INCR) + BAR_INCR;
+  /* Need the atomic store for acquire-release synchronisation with the
+     load in `gomp_team_barrier_wait_{cancel_,}end`.  See PR112356  */
+  __atomic_store_n (&bar->generation, (state & -BAR_INCR) + BAR_INCR,
+                   MEMMODEL_RELEASE);
 }
 
 static inline bool
diff --git a/libgomp/config/posix/bar.c b/libgomp/config/posix/bar.c
index ce69905ba674..c46659bd2645 100644
--- a/libgomp/config/posix/bar.c
+++ b/libgomp/config/posix/bar.c
@@ -123,7 +123,9 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state)
       struct gomp_team *team = thr->ts.team;
 
       team->work_share_cancelled = 0;
-      if (team->task_count)
+      unsigned task_count
+       = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
+      if (task_count)
        {
          gomp_barrier_handle_tasks (state);
          if (n > 0)
@@ -185,7 +187,9 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
       struct gomp_team *team = thr->ts.team;
 
       team->work_share_cancelled = 0;
-      if (team->task_count)
+      unsigned task_count
+       = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
+      if (task_count)
        {
          gomp_barrier_handle_tasks (state);
          if (n > 0)
diff --git a/libgomp/config/posix/bar.h b/libgomp/config/posix/bar.h
index 5a175c228c26..026daca793d5 100644
--- a/libgomp/config/posix/bar.h
+++ b/libgomp/config/posix/bar.h
@@ -152,7 +152,10 @@ gomp_team_barrier_cancelled (gomp_barrier_t *bar)
 static inline void
 gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t state)
 {
-  bar->generation = (state & -BAR_INCR) + BAR_INCR;
+  /* Need the atomic store for acquire-release synchronisation with the
+     load in `gomp_team_barrier_wait_{cancel_,}end`.  See PR112356  */
+  __atomic_store_n (&bar->generation, (state & -BAR_INCR) + BAR_INCR,
+                   MEMMODEL_RELEASE);
 }
 
 static inline bool
diff --git a/libgomp/config/rtems/bar.h b/libgomp/config/rtems/bar.h
index 61fa91f300f2..80fb1cd3be87 100644
--- a/libgomp/config/rtems/bar.h
+++ b/libgomp/config/rtems/bar.h
@@ -164,7 +164,10 @@ gomp_team_barrier_cancelled (gomp_barrier_t *bar)
 static inline void
 gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t state)
 {
-  bar->generation = (state & -BAR_INCR) + BAR_INCR;
+  /* Need the atomic store for acquire-release synchronisation with the
+     load in `gomp_team_barrier_wait_{cancel_,}end`.  See PR112356  */
+  __atomic_store_n (&bar->generation, (state & -BAR_INCR) + BAR_INCR,
+                   MEMMODEL_RELEASE);
 }
 
 static inline bool
diff --git a/libgomp/task.c b/libgomp/task.c
index 554636aadd5c..cbba28516e3f 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -1702,7 +1702,13 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
              if (do_wake > new_tasks)
                do_wake = new_tasks;
            }
-         --team->task_count;
+         /* Need to use RELEASE to sync with barrier read outside of the
+            tasking code (See PR122356).  Only care when decrementing to zero
+            because that's what the barrier cares about.  */
+         if (team->task_count == 1)
+           __atomic_store_n (&team->task_count, 0, MEMMODEL_RELEASE);
+         else
+           team->task_count--;
        }
     }
 }
diff --git a/libgomp/testsuite/libgomp.c/pr122356.c 
b/libgomp/testsuite/libgomp.c/pr122356.c
new file mode 100644
index 000000000000..76879511ff2f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/pr122356.c
@@ -0,0 +1,40 @@
+#include <omp.h>
+
+void abort ();
+
+#define NUM_THREADS 8
+unsigned full_data[NUM_THREADS] = {0};
+#pragma omp declare target enter(full_data)
+
+void
+test ()
+{
+#pragma omp parallel num_threads(8)
+  {
+#pragma omp for
+    for (int i = 0; i < 10; i++)
+#pragma omp task
+      {
+       full_data[omp_get_thread_num ()] += 1;
+      }
+#pragma omp barrier
+
+    unsigned total = 0;
+    for (int i = 0; i < NUM_THREADS; i++)
+      total += full_data[i];
+
+    if (total != 10)
+      abort ();
+  }
+}
+#pragma omp declare target enter(test)
+
+
+int
+main ()
+{
+  test ();
+
+  #pragma omp target
+    test ();
+}

Reply via email to