target

Matthew Malcomson Wed, 11 Feb 2026 08:49:31 -0800

Ping

On 1/29/26 01:59, Prathamesh Kulkarni wrote:

-----Original Message-----
From: [email protected] <[email protected]>
Sent: 26 November 2025 18:42
To: [email protected]; Jakub Jelinek <[email protected]>; Tobias
Burnus <[email protected]>
Cc: Julian Brown <[email protected]>; Thomas Schwinge
<[email protected]>; Andrew Stubbs <[email protected]>; Tom de
Vries <[email protected]>; Sebastian Huber <sebastian.huber@embedded-
brains.de>; Matthew Malcomson <[email protected]>
Subject: [PATCH 3/5] libgomp: Implement "flat" barrier for linux/
target

External email: Use caution opening links or attachments


From: Matthew Malcomson <[email protected]>

Working on PR119588.  From that PR:
We've seen on some internal workloads (NVPL BLAS running GEMM routine
on a small matrix) that the overhead of a `#pragma omp parallel`
statement when running with a high number of cores (72 or 144) is much
higher with the libgomp implementation than with LLVM's libomp.

In a program which has both some work that can be handled with high
parallelism (so OMP is running with many threads) and a large number
of small pieces of work that need to be performed with low overhead,
this has been seen to cause a significant overhead when accumulated.

------------------------------
Points that I would like feedback on:
- Is this extra complexity OK, especially for the futex_waitv
   fallback?
   - I have thought about creating yet another target (in the same way
     that the linux/ target is for kernels where `futex` is available).
     Would this approach be favoured?
- Are the changes in `gomp_team_start` significant enough for other
   targets that it's worth bringing it into the linux/ target?
   Similar to what I see the nvptx/ target has done for
   `gomp_team_start`  Could have some `ifdef` for the alternate barrier
   API rather than try to modify all the other interfaces to match a
   slightly modified API.
   - Note that the extra changes required here "disappear" in the
second
     patch in this patch series.  This because the second patch in this
     series removes the "simple-barrier" from the main loop in
     gomp_thread_start for non-nested threads.  Since that simple-
barrier
     is not in the main loop it can be a slower-but-simpler version of
     the barrier that can very easily be changed to handle a different
     number of threads.  Hence the extra considerations around
adjusting
     the number of threads needed in this patch are much less.
- As it stands I've not adjusted other targets except posix/ to work
   with this alternate approach as I hope that the second patch in the
   series will also be accepted (removing the need to implement some
   slightly tricky functionality in each backend).
   - Please do mention if this seems acceptable.
   - I adjusted the posix/ target since I could test it and I wanted to
     ensure that the new API could work for other targets.

------------------------------
This patch introduces a barrier that performs better on the
microbenchmark provided in PR119588.  I have ran various higher-level
benchmarks and have not seen anything I can claim as outside noise.

The outline of the barrier is:
1) Each thread associated with a barrier has its own ID (currently the
    team_id of the team that the current thread is a part of).
2) When a thread arrives at the barrier it marks its "arrived flag" as
    having arrived.
3) ID 0 (which is always the thread which does the bookkeeping for
    this team, at the top-level it's the primary thread) waits on each
    of the other threads "arrived flags".  It executes tasks while
    waiting on these threads to arrive.
4) Once ID 0 sees all other threads have arrived it signals to all
    secondary threads that the barrier is complete via incrementing the
    global generation (as usual).

Fundamental difference that affects performance being that we have
thread-local "arrived" flags instead of a single counter.  That means
there is less contention and appears to speed up the barrier
significantly when threads are hitting the barrier very hard.

Other interesting differences are:
1) The "coordinating" thread is pre-determined rather than "whichever
    thread hits the barrier last".
    - This has some knock-on effects w.r.t. how the barrier is used.
      (See e.g. the changes in work.c).
    - This also means that the behaviour of the pre-determined
      "coordinating" thread must be more complicated while it's waiting
      on other threads to arrive.  Especially see the task handling in
      `gomp_team_barrier_ensure_last`.
    - Because we assign the "coordinating" thread to be the primary
      thread this does mean that in-between points (3) and (4) above
      the primary thread can see the results of the operations of all
      secondary threads.  A point important for another optimisation I
      hope to make later where we only go through one barrier per
      iteration in the main execution loop of `gomp_thread_start` for
      top-level threads.
2) When a barrier needs to be adjusted so that different threads have
    different ID's we must serialise all threads before this
adjustment.
    - The previous design had no assignment of threads to some ID which
      meant the `gomp_simple_barrier_reinit` function in
       `gomp_team_start` simply needed to handle increasing or
      decreasing the number of threads a barrier is taking care of.

The two largest complexity pieces of work here are related to those
differences.

The work in `gomp_team_barrier_ensure_last` (and the cancel variant)
needs to watch for either the case of a secondary thread arriving or
of a task being enqueued.  When neither of these events occur for a
while this necessitates the use of `futex_waitv` or the fallback
approach to wait on one of two events happening.
- N.b. this fallback implementation for futex_waitv is particularly
   complex.  Needed for those linux kernels older than 5.16 which don't
   have the `futex_waitv` syscall.  It is the only time where any
   thread except the "primary" (for this team) adjusts
   `bar->generation` while in the barrier, and it is the only time
   where any thread modifies the "arrive flag" of some other thread.
   Logic for this is that after deciding to stop spinning on a given
   secondary threads "arrived" flag the primary thread sets a bit on
said
   secondary threads "arrived" flag to indicate that the primary thread
   is waiting on this thread.  Then the primary thread does a
   `futex_wait` on `bar->generation`.  If the secondary thread sees
   this bit set on its "arrived" it sets another bit
   `BAR_SECONDARY_ARRIVED` on `bar->generation` to wake up the primary
   thread.  Other threads performing tasking ignore that bit.
- The possibility of some secondary thread adjusting `bar->generation`
   without taking any lock means that the `gomp_team_barrier_*`
   functions used in general code that modify this variable all need to
   now be atomic.  These memory modifications don't need to synchronise
   with each other -- they are sending signals to completely different
   places -- but we need to make sure that the RMW is atomic and
   doesn't overwrite a bit set elsewhere.
- Since we are now adjusting bar->generation atomically, we can do the
   same in `gomp_team_barrier_cancel` and avoid taking the task_lock
   mutex there.
   - N.b. I believe we're also avoiding some existing UB since there is
     an atomic store to `bar->generation` in the gomp_barrier_wait_end
     variations done outside of the `task_lock` mutex.
     In C I believe that any write & read that are not ordered by
     "happens-before" of a single variable must be atomic to avoid UB.
     Even if on the architectures that libgomp supports the reads and
     writes actually emitted would be atomic and any data race is
benign.

The adjustment of how barriers are initialised in `gomp_team_start` is
related to the second point above.  If we are removing some threads
and adding others to the thread pool then we need to serialise
execution before adjusting the barrier state to be ready for some new
threads to take the original team ID's.
This is a clear cost to pay for this different barrier, but it does
not seem to have cost much in the benchmarks I've ran.
- To note again, this cost is not in the final version of libgomp
after
   the entirety of the patch series I'm posting.  That because in the
   second patch we no longer need to adjust the size of one of these
   alternate barrier implementations.

------------------------------
I've had to adjust the interface to the other barriers.  Most of the
cases are simply to add an extra unused parameter.  There are also
some dummied out functions to add.

The biggest difference to other barrier interfaces is in the
reinitialisation steps.  The posix/ barrier has a reinit function that
adjusts the number of threads it will wait on.  There is no
serialisation required in order to perform this re-init.  Instead,
when finishing some threads and starting others we increase the
"total" to handle the total number then once all threads have arrived
at the barrier decrease the "total" to be correct for "the next time
around".
- Again worth remembering that these extra reinit functions disappear
   when the second patch of this series is applied.

Testing done:
- Bootstrap & regtest on aarch64 and x86_64.
   - With & without _LIBGOMP_CHECKING_.
   - Testsuite with & without OMP_WAIT_POLICY=passive
   - With and without configure `--enable-linux-futex=no` for posix
     target.
   - Forcing use of futex_waitv syscall, and of fallback.
- Cross compilation & regtest on arm.
- TSAN done on this as part of all my upstream patches.

Hi,
ping: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702031.html

Thanks,
Prathamesh


libgomp/ChangeLog:

         * barrier.c (GOMP_barrier, GOMP_barrier_cancel): Pass team_id
         to barrier functions.
         * config/linux/bar.c (gomp_barrier_ensure_last): New.
         (gomp_assert_and_increment_flag): New.
         (gomp_team_barrier_ensure_last): New.
         (gomp_assert_and_increment_cancel_flag): New.
         (gomp_team_barrier_ensure_cancel_last): New.
         (gomp_barrier_wait_end): Extra assertions.
         (gomp_barrier_wait,gomp_team_barrier_wait,
         gomp_team_barrier_wait_cancel): Rewrite to use corresponding
         gomp_barrier_ensure_last variant.
         (gomp_barrier_wait_last): Rewrite for new parameter.
         (gomp_team_barrier_wait_end): Update to handle new barrier
flags
         & new parameter.
         (gomp_team_barrier_wait_cancel_end): Likewise.
         (gomp_team_barrier_wait_final): Delegate to
         gomp_team_barrier_wait.
         (gomp_team_barrier_cancel): Rewrite to account for futex_waitv
         fallback.
         * config/linux/bar.h (struct gomp_barrier_t): Add
`threadgens`,
         remove `awaited` and `awaited_final`.
         (BAR_SECONDARY_ARRIVED): New.
         (BAR_SECONDARY_CANCELLABLE_ARRIVED): New.
         (BAR_CANCEL_INCR): New.
         (BAR_TASK_PENDING): Shift to accomodate new bits.
         (BAR_INCR): Shift to accomodate new bits.
         (BAR_FLAGS_MASK): New.
         (BAR_GEN_MASK): New.
         (BAR_BOTH_GENS_MASK): New.
         (BAR_CANCEL_GEN_MASK): New.
         (BAR_INCREMENT_CANCEL): New.
         (PRIMARY_WAITING_TG): New.
         (gomp_assert_seenflags): New.
         (gomp_barrier_has_space): New.
         (gomp_barrier_minimal_reinit): New.
         (gomp_barrier_init): Account for threadgens, remove awaited
and
         awaited_final.
         (gomp_barrier_reinit): Remove.
         (gomp_barrier_reinit_1): New.
         (gomp_barrier_destroy): Free new `threadgens` member.
         (gomp_barrier_wait): Add new `id` parameter.
         (gomp_barrier_wait_last): Likewise.
         (gomp_barrier_wait_end): Likewise.
         (gomp_team_barrier_wait): Likewise.
         (gomp_team_barrier_wait_final): Likewise.
         (gomp_team_barrier_wait_cancel): Likewise.
         (gomp_team_barrier_wait_cancel_end): Likewise.
         (gomp_barrier_reinit_2): New (dummy) implementation.
         (gomp_team_barrier_ensure_last): New declaration.
         (gomp_barrier_ensure_last): New declaration.
         (gomp_team_barrier_ensure_cancel_last): New declaration.
         (gomp_assert_and_increment_flag): New declaration.
         (gomp_assert_and_increment_cancel_flag): New declaration.
         (gomp_barrier_wait_start): Rewrite.
         (gomp_barrier_wait_cancel_start): Rewrite.
         (gomp_barrier_wait_final_start): Rewrite.
         (gomp_reset_cancellable_primary_threadgen): New.
         (gomp_team_barrier_set_task_pending): Make atomic.
         (gomp_team_barrier_clear_task_pending): Make atomic.
         (gomp_team_barrier_set_waiting_for_tasks): Make atomic.
         (gomp_team_barrier_waiting_for_tasks): Make atomic.
         (gomp_increment_gen): New to account for cancel/non-cancel
         distinction when incrementing generation.
         (gomp_team_barrier_done): Use gomp_increment_gen.
         (gomp_barrier_state_is_incremented): Use gomp_increment_gen.
         (gomp_barrier_has_completed): Load generation atomically.
         (gomp_barrier_prepare_reinit): New.
         * config/linux/wait.h (FUTEX_32): New.
         (spin_count): Extracted from do_spin.
         (do_spin): Use spin_count.
         * config/posix/bar.c (gomp_barrier_wait,
         gomp_team_barrier_wait_end, gomp_team_barrier_wait_cancel_end,
         gomp_team_barrier_wait, gomp_team_barrier_wait_cancel): Add
and
         pass through new parameters.
         * config/posix/bar.h (gomp_barrier_wait,
gomp_team_barrier_wait,
         gomp_team_barrier_wait_cancel,
         gomp_team_barrier_wait_cancel_end, gomp_barrier_wait_start,
         gomp_barrier_wait_cancel_start, gomp_team_barrier_wait_final,
         gomp_barrier_wait_last, gomp_team_barrier_done): Add and pass
         through new id parameter.
         (gomp_barrier_prepare_reinit): New.
         (gomp_barrier_minimal_reinit): New.
         (gomp_barrier_reinit_1): New.
         (gomp_barrier_reinit_2): New.
         (gomp_barrier_has_space): New.
         (gomp_team_barrier_ensure_last): New.
         (gomp_team_barrier_ensure_cancel_last): New.
         (gomp_reset_cancellable_primary_threadgen): New.
         * config/posix/simple-bar.h (gomp_simple_barrier_reinit):
         (gomp_simple_barrier_minimal_reinit): New.
         (gomp_simple_barrier_reinit_1): New.
         (gomp_simple_barrier_wait): Add new parameter.
         (gomp_simple_barrier_wait_last): Add new parameter.
         (gomp_simple_barrier_prepare_reinit): New.
         (gomp_simple_barrier_reinit_2): New.
         (gomp_simple_barrier_has_space): New.
         * libgomp.h (gomp_assert): New.
         (gomp_barrier_handle_tasks): Add new parameter.
         * single.c (GOMP_single_copy_start): Pass new parameter.
         (GOMP_single_copy_end): Pass new parameter.
         * task.c (gomp_barrier_handle_tasks): Add and pass new
         parameter through.
         (GOMP_taskgroup_end): Pass new parameter.
         (GOMP_workshare_task_reduction_unregister): Pass new
parameter.
         * team.c (gomp_thread_start): Pass new parameter.
         (gomp_free_pool_helper): Likewise.
         (gomp_free_thread): Likewise.
         (gomp_team_start): Use new API for reinitialising a barrier.
         This requires passing information on which threads are being
         reused and which are not.
         (gomp_team_end): Pass new parameter.
         (gomp_pause_pool_helper): Likewise.
         (gomp_pause_host): Likewise.
         * work.c (gomp_work_share_end): Directly use
         `gomp_team_barrier_ensure_last`.
         (gomp_work_share_end_cancel): Similar for
         gomp_team_barrier_ensure_cancel_last but also ensure that
         thread local generations are reset if barrier is cancelled.
         * config/linux/futex_waitv.h: New file.
         * testsuite/libgomp.c/primary-thread-tasking.c: New test.

Signed-off-by: Matthew Malcomson <[email protected]>
---
  libgomp/barrier.c                             |   4 +-
  libgomp/config/linux/bar.c                    | 660 ++++++++++++++++-
-
  libgomp/config/linux/bar.h                    | 403 +++++++++--
  libgomp/config/linux/futex_waitv.h            | 129 ++++
  libgomp/config/linux/wait.h                   |  15 +-
  libgomp/config/posix/bar.c                    |  29 +-
  libgomp/config/posix/bar.h                    | 101 ++-
  libgomp/config/posix/simple-bar.h             |  39 +-
  libgomp/libgomp.h                             |  11 +-
  libgomp/single.c                              |   4 +-
  libgomp/task.c                                |  12 +-
  libgomp/team.c                                | 127 +++-
  .../libgomp.c/primary-thread-tasking.c        |  80 +++
  libgomp/work.c                                |  26 +-
  14 files changed, 1451 insertions(+), 189 deletions(-)
  create mode 100644 libgomp/config/linux/futex_waitv.h
  create mode 100644 libgomp/testsuite/libgomp.c/primary-thread-
tasking.c

diff --git a/libgomp/barrier.c b/libgomp/barrier.c
index 244dadd1adb..12e54a1b2b6 100644
--- a/libgomp/barrier.c
+++ b/libgomp/barrier.c
@@ -38,7 +38,7 @@ GOMP_barrier (void)
    if (team == NULL)
      return;

-  gomp_team_barrier_wait (&team->barrier);
+  gomp_team_barrier_wait (&team->barrier, thr->ts.team_id);
  }

  bool
@@ -50,5 +50,5 @@ GOMP_barrier_cancel (void)
    /* The compiler transforms to barrier_cancel when it sees that the
       barrier is within a construct that can cancel.  Thus we should
       never have an orphaned cancellable barrier.  */
-  return gomp_team_barrier_wait_cancel (&team->barrier);
+  return gomp_team_barrier_wait_cancel (&team->barrier, thr-

ts.team_id);

  }
diff --git a/libgomp/config/linux/bar.c b/libgomp/config/linux/bar.c
index 9eaec0e5f23..25f7e04dd16 100644
--- a/libgomp/config/linux/bar.c
+++ b/libgomp/config/linux/bar.c
@@ -29,15 +29,76 @@

  #include <limits.h>
  #include "wait.h"
+#include "futex_waitv.h"

+void
+gomp_barrier_ensure_last (gomp_barrier_t *bar, unsigned id,
+                         gomp_barrier_state_t state)
+{
+  gomp_assert (id == 0, "Calling ensure_last in thread %u", id);
+  gomp_assert (!(state & BAR_CANCELLED),
+              "BAR_CANCELLED set when using plain barrier: %u",
state);
+  unsigned tstate = state & BAR_GEN_MASK;
+  struct thread_lock_data *arr = bar->threadgens;
+  for (unsigned i = 1; i < bar->total; i++)
+    {
+      unsigned tmp;
+      do
+       {
+         do_wait ((int *) &arr[i].gen, tstate);
+         tmp = __atomic_load_n (&arr[i].gen, MEMMODEL_ACQUIRE);
+         gomp_assert (tmp == tstate || tmp == (tstate + BAR_INCR),
+                      "Thread-local state seen to be %u"
+                      " when global gens are %u",
+                      tmp, tstate);
+       }
+      while (tmp != (tstate + BAR_INCR));
+    }
+  gomp_assert_seenflags (bar, false);
+}

  void
-gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
state)
+gomp_assert_and_increment_flag (gomp_barrier_t *bar, unsigned id,
unsigned gens)
  {
+  /* Our own personal flag, so don't need to atomically read it
except in the
+     case of the fallback code for when running on an older Linux
kernel
+     (without futex_waitv).  Do need to atomically update it for the
primary
+     thread to read and form an acquire-release synchronisation from
our thread
+     to the primary thread.  */
+  struct thread_lock_data *arr = bar->threadgens;
+  unsigned orig = __atomic_fetch_add (&arr[id].gen, BAR_INCR,
MEMMODEL_RELEASE);
+  futex_wake ((int *) &arr[id].gen, INT_MAX);
+  /* This clause is only to handle the fallback when `futex_waitv` is
not
+     available on the kernel we're running on.  For the logic of this
+     particular synchronisation see the comment in
+     `gomp_team_barrier_ensure_last`.  */
+  unsigned gen = gens & BAR_GEN_MASK;
+  if (__builtin_expect (orig == (gen | PRIMARY_WAITING_TG), 0))
+    {
+      __atomic_fetch_or (&bar->generation, BAR_SECONDARY_ARRIVED,
+                        MEMMODEL_RELEASE);
+      futex_wake ((int *) &bar->generation, INT_MAX);
+    }
+  else
+    {
+      gomp_assert (orig == gen, "Original flag %u != generation of
%u\n", orig,
+                  gen);
+    }
+}
+
+void
+gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
state,
+                      unsigned id)
+{
+  gomp_assert (!(state & BAR_CANCELLED),
+              "BAR_CANCELLED set when using plain barrier: %u",
state);
    if (__builtin_expect (state & BAR_WAS_LAST, 0))
      {
-      /* Next time we'll be awaiting TOTAL threads again.  */
-      bar->awaited = bar->total;
+      gomp_assert (id == 0, "Id %u believes it is last\n", id);
+      /* Shouldn't have anything modifying bar->generation at this
point.  */
+      gomp_assert ((bar->generation & ~BAR_BOTH_GENS_MASK) == 0,
+                  "flags set in gomp_barrier_wait_end: %u",
+                  bar->generation & ~BAR_BOTH_GENS_MASK);
        __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
                         MEMMODEL_RELEASE);
        futex_wake ((int *) &bar->generation, INT_MAX);
@@ -51,9 +112,12 @@ gomp_barrier_wait_end (gomp_barrier_t *bar,
gomp_barrier_state_t state)
  }

  void
-gomp_barrier_wait (gomp_barrier_t *bar)
+gomp_barrier_wait (gomp_barrier_t *bar, unsigned id)
  {
-  gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
+  gomp_barrier_state_t state = gomp_barrier_wait_start (bar, id);
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    gomp_barrier_ensure_last (bar, id, state);
+  gomp_barrier_wait_end (bar, state, id);
  }

  /* Like gomp_barrier_wait, except that if the encountering thread
@@ -64,11 +128,29 @@ gomp_barrier_wait (gomp_barrier_t *bar)
     the barrier can be safely destroyed.  */

  void
-gomp_barrier_wait_last (gomp_barrier_t *bar)
+gomp_barrier_wait_last (gomp_barrier_t *bar, unsigned id)
  {
-  gomp_barrier_state_t state = gomp_barrier_wait_start (bar);
+  gomp_assert (id != 0, "Thread with id %u called
gomp_barrier_wait_last", id);
+  gomp_barrier_wait_start (bar, id);
+  /* N.b. For the intended use of this function we don't actually
need to do
+     the below.  `state & BAR_WAS_LAST` should never be non-zero.
The
+     assertion above checks that `id != 0` and that's the only case
when this
+     could be true.  However we write down what the code should be
if/when
+     `gomp_barrier_wait_last` gets used in a different way.
+
+     If we do want to provide the possibility for non-primary threads
to know
+     that this barrier can be destroyed we would need that non-
primary thread
+     to wait on the barrier (as it would having called
gomp_barrier_wait`, and
+     the primary thread to signal to it that all other threads have
arrived.
+     That would be done with the below (having gotten `state` from
the return
+     value of `gomp_barrier_wait_start` call above):
+
    if (state & BAR_WAS_LAST)
-    gomp_barrier_wait_end (bar, state);
+    {
+      gomp_barrier_ensure_last (bar, id, state);
+      gomp_barrier_wait_end (bar, state, id);
+    }
+     */
  }

  void
@@ -77,24 +159,231 @@ gomp_team_barrier_wake (gomp_barrier_t *bar, int
count)
    futex_wake ((int *) &bar->generation, count == 0 ? INT_MAX :
count);
  }

+/* We need to have some "ensure we're last, but perform useful work
+   in the meantime" function.  This allows the primary thread to
perform useful
+   work while it is waiting on all secondary threads.
+   - Again related to the fundamental difference between this barrier
and the
+     "centralized" barrier where the thread doing the bookkeeping is
+     pre-determined as some "primary" thread and that is not
necessarily the
+     last thread to enter the barrier.
+
+   To do that we loop through checking each of the other threads
flags.  If any
+   are not set then we take that opportunity to check the the global
generation
+   number, if there's some task to handle then do so before going
back to
+   checking the remaining thread-local flags.
+
+   This approach means that the cancellable barriers work reasonably
naturally.
+   If we're checking the global generation flag then we can see when
it is
+   cancelled.  Hence `gomp_team_barrier_ensure_cancel_last` below
does
+   something very similar to this function here.
+
+   The "wait for wakeup" is a little tricky.  When there is nothing
for a
+   thread to do we usually call `futex_wait` on an address.  In this
case we
+   want to wait on one of two addresses being changed.  In Linux
kernels >=
+   5.16 there is the `futex_waitv` syscall which provides us exactly
that
+   functionality, but in older kernels we have to implement some
mechanism to
+   emulate the functionality.
+
+   On these older kernels (as a fallback implementation) we do the
following:
+   - Primary fetch_add's the PRIMARY_WAITING_TG bit to the thread-
local
+     generation number of the secondary it's waiting on.
+     - If primary sees that the BAR_INCR was added then secondary
reached
+       barrier as we decided to go to sleep waiting for it.
+       Clear the flag and continue.
+     - Otherwise primary futex_wait's on the generation number.
+       - Past this point the only state that this primary thread will
use
+        to indicate that this particular thread has arrived will be
+        BAR_SECONDARY_ARRIVED set on the barrier-global generation
number.
+        (the increment of the thread-local generation number no
longer means
+        anything to the primary thread, though we do assert that it's
done
+        in a checking build since that is still an invariant that
must hold
+        for later uses of this barrier).
+       - Can get woken up by a new task being added
(BAR_TASK_PENDING).
+       - Can get woken up by BAR_SECONDARY_ARRIVED bit flag in the
global
+        generation number saying "secondary you were waiting on has
+        arrived".
+        - In this case it clears that bit and returns to looking to
see if
+          all secondary threads have arrived.
+   Meanwhile secondary thread:
+   - Checks the value it gets in `gomp_assert_and_increment_flag`.
+     - If it has the PRIMARY_WAITING_TG flag then set
+       BAR_SECONDARY_ARRIVED on global generation number and
futex_wake
+       it.
+   Each secondary thread has to ignore BAR_SECONDARY_ARRIVED in their
loop
+   (will still get woken up by it, just continue around the loop
until it's
+   cleared).
+
+   N.b. using this fallback mechanism is the only place where any
thread
+   modifies a thread-local generation number that is not its own.
This is also
+   the only place where a secondary thread would modify bar-

generation without

+   a lock held.  Hence these modifications need a reasonable amount
of care
+   w.r.t.  atomicity.  */
  void
-gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
state)
+gomp_team_barrier_ensure_last (gomp_barrier_t *bar, unsigned id,
+                              gomp_barrier_state_t state)
+{
+  /* Thought process here:
+     - Until we are the last thread, we are "some thread waiting on
the
+       barrier".  Hence we should doing only a variant of what is
done in
+       `gomp_team_barrier_wait_end` for the other threads.
+     - The loop done in that function is:
+       1) Wait on `generation` having some flag.
+       2) When it changes, enforce the acquire-release semantics and
+         re-load.
+       3) If added `BAR_TASK_PENDING`, then handle some tasks.
+       4) Ignore `BAR_WAITING_FOR_TASK` flag.
+     - Hence the loop done in this function is:
+       1) Look through each thread-local generation number.
+         - When this is not incremented, then:
+           1. wait on `generation` having some flag.
+           2. When it does have such a flag, enforce the acquire-
release
+              semantics and load.
+           3. If added `BAR_TASK_PENDING` then handle some tasks.
+           4. Ignore `BAR_WAITING_FOR_TASK` not necessary because
it's us that
+              would set that flag.
+       Differences otherwise are that we put pausing in different
positions.  */
+  gomp_assert (id == 0, "Calling ensure_last in thread %u\n", id);
+  unsigned gstate = state & (BAR_BOTH_GENS_MASK | BAR_CANCELLED);
+  unsigned tstate = state & BAR_GEN_MASK;
+  struct thread_lock_data *arr = bar->threadgens;
+  for (unsigned i = 1; i < bar->total; i++)
+    {
+      unsigned long long j, count = spin_count ();
+
+    wait_on_this_thread:
+      /* Use `j <= count` here just in case `gomp_spin_count_var ==
0` (which
+        can happen with OMP_WAIT_POLICY=passive).  We need to go
around this
+        loop at least once to check and handle any changes.  */
+      for (j = 0; j <= count; j++)
+       {
+         /* Thought about using MEMMODEL_ACQUIRE or MEMMODEL_RELAXED
until we
+            see a difference and *then* MEMMODEL_ACQUIRE for the
+            acquire-release semantics.  Idea was that the more
relaxed memory
+            model might provide a performance boost.  Did not see any
+            improvement on a micro-benchmark and decided not to do
that.
+
+            TODO Question whether to run one loop checking both
variables, or
+                 two loops each checking one variable multiple times.
Suspect
+                 that one loop checking both variables is going to be
more
+                 responsive and more understandable in terms of
performance
+                 characteristics when specifying GOMP_SPINCOUNT.
+                 However there's the chance that checking each
variable
+                 something like 10000 times and going around this
loop
+                 count/10000 times could give better throughput?  */
+         unsigned int threadgen
+           = __atomic_load_n (&arr[i].gen, MEMMODEL_ACQUIRE);
+         /* If this thread local generation number has
PRIMARY_WAITING_TG set,
+            then we must have set it (primary thread is the only one
that sets
+            this flag).  The mechanism by which that would be set is:
+            1) We dropped into `futex_waitv` while waiting on this
thread
+            2) Some secondary thread woke us (either by setting
+               `BAR_TASK_PENDING` or the thread we were waiting on
setting
+               `BAR_SECONDARY_ARRIVED`).
+            3) We came out of `futex_wake` and started waiting on
this thread
+               again.
+            Go into the `bar->generation` clause below to reset state
around
+            here and we'll go back to this thing later.  N.b. having
+            `PRIMARY_WAITING_TG` set on a given thread local
generation flag
+            means that the "continue" flag for this thread is now
seeing
+            `BAR_SECONDARY_ARRIVED` set on the global generation
flag.  */
+         if (__builtin_expect (threadgen != tstate, 0)
+             && __builtin_expect (!(threadgen & PRIMARY_WAITING_TG),
1))
+           {
+             gomp_assert (threadgen == (tstate + BAR_INCR),
+                          "Thread-local state seen to be %u"
+                          " when global state was %u.\n",
+                          threadgen, tstate);
+             goto wait_on_next_thread;
+           }
+         /* Only need for MEMMODEL_ACQUIRE below is when using the
futex_waitv
+            fallback for older kernels.  When this happens we use the
+            BAR_SECONDARY_ARRIVED flag for synchronisation need to
ensure the
+            acquire-release synchronisation is formed from secondary
thread to
+            primary thread as per OpenMP flush requirements on a
barrier.  */
+         unsigned int gen
+           = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+         if (__builtin_expect (gen != gstate, 0))
+           {
+             /* If there are some tasks to perform then perform them.
*/
+             if (gen & BAR_TASK_PENDING)
+               gomp_barrier_handle_tasks (gstate, false);
+             if (gen & BAR_SECONDARY_ARRIVED)
+               {
+                 /* Secondary thread has arrived, clear the flag on
the
+                    thread local generation and the global
generation.
+                    Invariants are:
+                    - BAR_SECONDARY_ARRIVED is only ever set by the
secondary
+                      threads and only when their own thread-local
generation
+                      has the flag PRIMARY_WAITING_TG.
+                    - PRIMARY_WAITING_TG flag is only ever set by the
primary
+                      thread.
+                    - Release-Acquire ordering on the generation
number means
+                      that as far as this thread is concerned the
relevant
+                      secondary thread must have incremented its
generation.
+                    - When using the fallback the required "flush"
release
+                      semantics are provided by the Release-Acquire
+                      syncronisation on the generation number (with
+                      BAR_SECONDARY_ARRIVED). Otherwise it's provided
by the
+                      Release-Acquire ordering on the thread-local
generation.
+                 */
+                 __atomic_fetch_and (&bar->generation,
~BAR_SECONDARY_ARRIVED,
+                                     MEMMODEL_RELAXED);
+                 gomp_assert (
+                   (threadgen == ((tstate + BAR_INCR) |
PRIMARY_WAITING_TG))
+                     || (threadgen == (tstate | PRIMARY_WAITING_TG)),
+                   "Thread %d local generation is %d but expected"
+                   " PRIMARY_WAITING_TG set because bar->generation"
+                   " marked with SECONDARY (%d)",
+                   i, threadgen, gen);
+                 __atomic_fetch_and (&arr[i].gen,
~PRIMARY_WAITING_TG,
+                                     MEMMODEL_RELAXED);
+               }
+             /* There should be no other way in which this can be
different. */
+             gomp_assert (
+               (gen & BAR_CANCELLED) == (gstate & BAR_CANCELLED),
+               "Unnecessary looping due to mismatching BAR_CANCELLED"
+               " bar->generation: %u   state: %u",
+               gen, gstate);
+             gomp_assert (
+               !(gen & BAR_WAITING_FOR_TASK),
+               "BAR_WAITING_FOR_TASK set by non-primary thread
gen=%d", gen);
+             gomp_assert (
+               !gomp_barrier_state_is_incremented (gen, gstate,
BAR_INCR),
+               "Global state incremented by non-primary thread
gen=%d", gen);
+             goto wait_on_this_thread;
+           }
+       }
+      /* Neither thread-local generation number nor global generation
seem to
+        be changing.  Wait for one of them to change.  */
+      futex_waitv ((int *) &arr[i].gen, tstate, (int *) &bar-

generation,

+                  gstate);
+      /* One of the above values has changed, go back to the start of
this loop
+       * and we can find out what it was and deal with it
accordingly.  */
+      goto wait_on_this_thread;
+
+    wait_on_next_thread:
+      continue;
+    }
+  gomp_assert_seenflags (bar, false);
+}
+
+void
+gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
state,
+                           unsigned id)
  {
    unsigned int generation, gen;

    if (__builtin_expect (state & BAR_WAS_LAST, 0))
      {
-      /* Next time we'll be awaiting TOTAL threads again.  */
+      gomp_assert (id == 0, "Id %u believes it is last\n", id);
        struct gomp_thread *thr = gomp_thread ();
        struct gomp_team *team = thr->ts.team;
-
-      bar->awaited = bar->total;
        team->work_share_cancelled = 0;
        unsigned task_count
         = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
        if (__builtin_expect (task_count, 0))
         {
-         gomp_barrier_handle_tasks (state);
+         gomp_barrier_handle_tasks (state, false);
           state &= ~BAR_WAS_LAST;
         }
        else
@@ -113,103 +402,376 @@ gomp_team_barrier_wait_end (gomp_barrier_t
*bar, gomp_barrier_state_t state)
      {
        do_wait ((int *) &bar->generation, generation);
        gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+      gomp_assert ((gen == state + BAR_INCR)
+                    || (gen & BAR_CANCELLED) == (generation &
BAR_CANCELLED)
+                    /* Can cancel a barrier when already gotten into
final
+                       implicit barrier at the end of a parallel
loop.
+                       This happens in `cancel-parallel-3.c`.
+                       In this case the above assertion does not hold
because
+                       We are waiting on the implicit barrier at the
end of a
+                       parallel region while some other thread is
performing
+                       work in that parallel region, hits a
+                       `#pragma omp cancel parallel`, and sets said
flag.  */
+                    || !(generation & BAR_CANCELLED),
+                  "Unnecessary looping due to BAR_CANCELLED diff"
+                  " gen: %u  generation: %u  id: %u",
+                  gen, generation, id);
        if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
         {
-         gomp_barrier_handle_tasks (state);
+         gomp_barrier_handle_tasks (state, false);
           gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
         }
+      /* These flags will not change until this barrier is completed.
+        Going forward we don't want to be continually waking up
checking for
+        whether this barrier has completed yet.
+
+        If the barrier is cancelled but there are tasks yet to
perform then
+        some thread will have used `gomp_barrier_handle_tasks` to go
through
+        all tasks and drop them.  */
        generation |= gen & BAR_WAITING_FOR_TASK;
+      generation |= gen & BAR_CANCELLED;
+      /* Other flags that may be set in `bar->generation` are:
+        1) BAR_SECONDARY_ARRIVED
+        2) BAR_SECONDARY_CANCELLABLE_ARRIVED
+        While we want to ignore these, they should be transient and
quickly
+        removed, hence we don't adjust our expected `generation`
accordingly.
+        TODO Would be good to benchmark both approaches.  */
      }
-  while (!gomp_barrier_state_is_incremented (gen, state));
+  while (!gomp_barrier_state_is_incremented (gen, state, false));
  }

  void
-gomp_team_barrier_wait (gomp_barrier_t *bar)
+gomp_team_barrier_wait (gomp_barrier_t *bar, unsigned id)
  {
-  gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
+  gomp_barrier_state_t state = gomp_barrier_wait_start (bar, id);
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    gomp_team_barrier_ensure_last (bar, id, state);
+  gomp_team_barrier_wait_end (bar, state, id);
  }

  void
-gomp_team_barrier_wait_final (gomp_barrier_t *bar)
+gomp_team_barrier_wait_final (gomp_barrier_t *bar, unsigned id)
  {
-  gomp_barrier_state_t state = gomp_barrier_wait_final_start (bar);
-  if (__builtin_expect (state & BAR_WAS_LAST, 0))
-    bar->awaited_final = bar->total;
-  gomp_team_barrier_wait_end (bar, state);
+  return gomp_team_barrier_wait (bar, id);
+}
+
+void
+gomp_assert_and_increment_cancel_flag (gomp_barrier_t *bar, unsigned
id,
+                                      unsigned gens)
+{
+  struct thread_lock_data *arr = bar->threadgens;
+  /* Because we have separate thread local generation numbers for
cancellable
+     barriers and non-cancellable barriers, we can use fetch_add in
the
+     cancellable thread-local generation number and just let the bits
roll over
+     into the BAR_INCR area.  This means we don't have to have a CAS
loop to
+     handle the futex_waitv_fallback possibility that the primary
thread
+     updates our thread-local variable at the same time as we do.  */
+  unsigned orig
+    = __atomic_fetch_add (&arr[id].cgen, BAR_CANCEL_INCR,
MEMMODEL_RELEASE);
+  futex_wake ((int *) &arr[id].cgen, INT_MAX);
+
+  /* However, using that fetch_add means that we need to mask out the
values
+     we're comparing against.  Since this masking is not a memory
operation we
+     believe that trade-off is good.  */
+  unsigned orig_cgen = orig & (BAR_CANCEL_GEN_MASK | BAR_FLAGS_MASK);
+  unsigned global_cgen = gens & BAR_CANCEL_GEN_MASK;
+  if (__builtin_expect (orig_cgen == (global_cgen |
PRIMARY_WAITING_TG), 0))
+    {
+      unsigned prev = __atomic_fetch_or (&bar->generation,
+
BAR_SECONDARY_CANCELLABLE_ARRIVED,
+                                        MEMMODEL_RELAXED);
+      /* Wait! The barrier got cancelled, this flag is nothing but
annoying
+        state to ignore in the non-cancellable barrier that will be
coming up
+        soon and is state that needs to be reset before any following
+        cancellable barrier is called.  */
+      if (prev & BAR_CANCELLED)
+       __atomic_fetch_and (&bar->generation,
+                           ~BAR_SECONDARY_CANCELLABLE_ARRIVED,
+                           MEMMODEL_RELAXED);
+      futex_wake ((int *) &bar->generation, INT_MAX);
+    }
+  else
+    {
+      gomp_assert (orig_cgen == global_cgen,
+                  "Id %u:  Original flag %u != generation of %u\n",
id,
+                  orig_cgen, global_cgen);
+    }
+}
+
+bool
+gomp_team_barrier_ensure_cancel_last (gomp_barrier_t *bar, unsigned
id,
+                                     gomp_barrier_state_t state)
+{
+  gomp_assert (id == 0, "Calling ensure_cancel_last in thread %u\n",
id);
+  unsigned gstate = state & BAR_BOTH_GENS_MASK;
+  unsigned tstate = state & BAR_CANCEL_GEN_MASK;
+  struct thread_lock_data *arr = bar->threadgens;
+  for (unsigned i = 1; i < bar->total; i++)
+    {
+      unsigned long long j, count = spin_count ();
+
+    wait_on_this_thread:
+      for (j = 0; j <= count; j++)
+       {
+         unsigned int threadgen
+           = __atomic_load_n (&arr[i].cgen, MEMMODEL_ACQUIRE);
+         /* Clear "overrun" bits -- spillover into the non-
cancellable
+            generation numbers that we are leaving around in order to
avoid
+            having to perform extra memory operations in this
barrier.   */
+         threadgen &= (BAR_CANCEL_GEN_MASK | BAR_FLAGS_MASK);
+
+         if (__builtin_expect (threadgen != tstate, 0)
+             && __builtin_expect (!(threadgen & PRIMARY_WAITING_TG),
1))
+           {
+             gomp_assert (threadgen == BAR_INCREMENT_CANCEL (tstate),
+                          "Thread-local state seen to be %u"
+                          " when global state was %u.\n",
+                          threadgen, tstate);
+             goto wait_on_next_thread;
+           }
+         unsigned int gen
+           = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+         if (__builtin_expect (gen != gstate, 0))
+           {
+             gomp_assert (
+               !(gen & BAR_WAITING_FOR_TASK),
+               "BAR_WAITING_FOR_TASK set in non-primary thread
gen=%d", gen);
+             gomp_assert (!gomp_barrier_state_is_incremented (gen,
gstate,
+
BAR_CANCEL_INCR),
+                          "Global state incremented by non-primary
thread "
+                          "gstate=%d gen=%d",
+                          gstate, gen);
+             if (gen & BAR_CANCELLED)
+               {
+                 if (__builtin_expect (threadgen &
PRIMARY_WAITING_TG, 0))
+                   __atomic_fetch_and (&arr[i].cgen,
~PRIMARY_WAITING_TG,
+                                       MEMMODEL_RELAXED);
+                 /* Don't check for BAR_SECONDARY_CANCELLABLE_ARRIVED
here.
+                    There are too many windows for race conditions if
+                    resetting here (essentially can't guarantee that
we'll
+                    catch it so appearing like we might is just
confusing).
+                    Instead the primary thread has to ignore that
flag in the
+                    next barrier (which will be a non-cancellable
barrier),
+                    and we'll eventually reset it either in the
thread that
+                    set `BAR_CANCELLED` or in the thread that set
+                    `BAR_SECONDARY_CANCELLABLE_ARRIVED`.  */
+                 return false;
+               }
+             if (gen & BAR_SECONDARY_CANCELLABLE_ARRIVED)
+               {
+                 __atomic_fetch_and (&bar->generation,
+
~BAR_SECONDARY_CANCELLABLE_ARRIVED,
+                                     MEMMODEL_RELAXED);
+                 gomp_assert (
+                   (threadgen
+                    == (BAR_INCREMENT_CANCEL (tstate) |
PRIMARY_WAITING_TG))
+                     || (threadgen == (tstate | PRIMARY_WAITING_TG)),
+                   "Thread %d local generation is %d but expected"
+                   " PRIMARY_WAITING_TG set because bar->generation"
+                   " marked with SECONDARY (%d)",
+                   i, threadgen, gen);
+                 __atomic_fetch_and (&arr[i].cgen,
~PRIMARY_WAITING_TG,
+                                     MEMMODEL_RELAXED);
+               }
+
+             if (gen & BAR_TASK_PENDING)
+               gomp_barrier_handle_tasks (gstate, false);
+             goto wait_on_this_thread;
+           }
+       }
+      futex_waitv ((int *) &arr[i].cgen, tstate, (int *) &bar-

generation,

+                  gstate);
+      goto wait_on_this_thread;
+
+    wait_on_next_thread:
+      continue;
+    }
+  gomp_assert_seenflags (bar, true);
+  return true;
  }

  bool
  gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
-                                  gomp_barrier_state_t state)
+                                  gomp_barrier_state_t state,
unsigned id)
  {
    unsigned int generation, gen;
+  gomp_assert (
+    !(state & BAR_CANCELLED),
+    "gomp_team_barrier_wait_cancel_end called when barrier cancelled
state: %u",
+    state);

    if (__builtin_expect (state & BAR_WAS_LAST, 0))
      {
-      /* Next time we'll be awaiting TOTAL threads again.  */
-      /* BAR_CANCELLED should never be set in state here, because
-        cancellation means that at least one of the threads has been
-        cancelled, thus on a cancellable barrier we should never see
-        all threads to arrive.  */
        struct gomp_thread *thr = gomp_thread ();
        struct gomp_team *team = thr->ts.team;
-
-      bar->awaited = bar->total;
        team->work_share_cancelled = 0;
        unsigned task_count
         = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
        if (__builtin_expect (task_count, 0))
         {
-         gomp_barrier_handle_tasks (state);
+         gomp_barrier_handle_tasks (state, true);
           state &= ~BAR_WAS_LAST;
         }
        else
         {
-         state += BAR_INCR - BAR_WAS_LAST;
+         state &= ~BAR_WAS_LAST;
+         state = BAR_INCREMENT_CANCEL (state);
           __atomic_store_n (&bar->generation, state,
MEMMODEL_RELEASE);
           futex_wake ((int *) &bar->generation, INT_MAX);
           return false;
         }
      }

-  if (__builtin_expect (state & BAR_CANCELLED, 0))
-    return true;
-
    generation = state;
    do
      {
        do_wait ((int *) &bar->generation, generation);
        gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
        if (__builtin_expect (gen & BAR_CANCELLED, 0))
-       return true;
+       {
+         if (__builtin_expect ((gen & BAR_CANCEL_GEN_MASK)
+                                 != (state & BAR_CANCEL_GEN_MASK),
+                               0))
+           {
+             /* Have cancelled a barrier just after completing the
current
+                one.  We must not reset our local state.  */
+             gomp_assert (
+               (gen & BAR_CANCEL_GEN_MASK)
+                 == (BAR_INCREMENT_CANCEL (state) &
BAR_CANCEL_GEN_MASK),
+               "Incremented global generation (cancellable) more than
one"
+               " gen: %u  original state: %u",
+               gen, state);
+             /* Other threads could have continued on in between the
time that
+                the primary thread signalled all other threads should
wake up
+                and the time that we actually read `bar->generation`
above.
+                Any one of them could be performing tasks, also they
could be
+                waiting on the next barrier and could have set
+                BAR_SECONDARY{_CANCELLABLE,}_ARRIVED.
+
+                W.r.t. the task handling that could have been set
(primary
+                thread increments generation telling us to go, then
it
+                continues itself and finds an `omp task` directive
and
+                schedules a task) we need to move on with it
(PR122314).  */
+             gomp_assert (!(gen & BAR_WAITING_FOR_TASK),
+                          "Generation incremented while "
+                          " main thread is still waiting for tasks:
gen: %u  "
+                          "original state: %u",
+                          gen, state);
+             /* *This* barrier wasn't cancelled -- the next barrier
is
+                cancelled.  Returning `false` gives that information
back up
+                to calling functions.  */
+             return false;
+           }
+         /* Need to reset our thread-local generation.  Don't want
state to be
+            messed up the next time we hit a cancellable barrier.
+            Must do this atomically because the cancellation signal
could
+            happen from "somewhere else" while the primary thread has
decided
+            that it wants to wait on us -- if we're using the
fallback for
+            pre-5.16 Linux kernels.
+
+            We are helped by the invariant that when this barrier is
+            cancelled, the next barrier that will be entered is a
+            non-cancellable barrier.  That means we don't have to
worry about
+            the primary thread getting confused by this generation
being
+            incremented to say it's reached a cancellable barrier.
+
+            We do not reset the PRIMARY_WAITING_TG bit.  That is left
to the
+            primary thread.  We only subtract the BAR_CANCEL_INCR
that we
+            added before getting here.
+
+            N.b. Another approach to avoid problems with multiple
threads
+            modifying this thread-local generation could be for the
primary
+            thread to reset the thread-local generations once it's
"gathered"
+            all threads during the next non-cancellable barrier.
Such a reset
+            would not need to be atomic because we would know that
all threads
+            have already acted on their thread-local generation
number.
+
+            That approach mirrors the previous approach where the
primary
+            thread would reset `awaited`.  The problem with this is
that now
+            we have *many* generations to reset, the primary thread
can be the
+            bottleneck in a barrier (it performs the most work) and
putting
+            more work into the primary thread while all secondary
threads are
+            waiting on it seems problematic.  Moreover outside of the
+            futex_waitv_fallback the primary thread does not adjust
the
+            thread-local generations.  Maintaining that property
where
+            possible seems very worthwhile.  */
+         unsigned orig __attribute__ ((unused))
+         = __atomic_fetch_sub (&bar->threadgens[id].cgen,
BAR_CANCEL_INCR,
+                               MEMMODEL_RELAXED);
+#if _LIBGOMP_CHECKING_
+         unsigned orig_gen = (orig & BAR_CANCEL_GEN_MASK);
+         unsigned global_gen_plus1
+           = ((gen + BAR_CANCEL_INCR) & BAR_CANCEL_GEN_MASK);
+         gomp_assert (orig_gen == global_gen_plus1
+                        || orig_gen == (global_gen_plus1 |
PRIMARY_WAITING_TG),
+                      "Thread-local generation %u unknown
modification:"
+                      " expected %u (with possible PRIMARY* flags)
seen %u",
+                      id, (gen + BAR_CANCEL_INCR) &
BAR_CANCEL_GEN_MASK, orig);
+#endif
+         return true;
+       }
        if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
         {
-         gomp_barrier_handle_tasks (state);
+         gomp_barrier_handle_tasks (state, true);
           gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
         }
        generation |= gen & BAR_WAITING_FOR_TASK;
      }
-  while (!gomp_barrier_state_is_incremented (gen, state));
+  while (!gomp_barrier_state_is_incremented (gen, state, true));

    return false;
  }

  bool
-gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
+gomp_team_barrier_wait_cancel (gomp_barrier_t *bar, unsigned id)
  {
-  return gomp_team_barrier_wait_cancel_end (bar,
gomp_barrier_wait_start (bar));
+  gomp_barrier_state_t state = gomp_barrier_wait_cancel_start (bar,
id);
+
+  if (__builtin_expect (state & BAR_CANCELLED, 0))
+    return true;
+
+  if (__builtin_expect (state & BAR_WAS_LAST, 0)
+      && !gomp_team_barrier_ensure_cancel_last (bar, id, state))
+    {
+      gomp_reset_cancellable_primary_threadgen (bar, id);
+      return true;
+    }
+  return gomp_team_barrier_wait_cancel_end (bar, state, id);
  }

  void
  gomp_team_barrier_cancel (struct gomp_team *team)
  {
-  gomp_mutex_lock (&team->task_lock);
-  if (team->barrier.generation & BAR_CANCELLED)
-    {
-      gomp_mutex_unlock (&team->task_lock);
-      return;
-    }
-  team->barrier.generation |= BAR_CANCELLED;
-  gomp_mutex_unlock (&team->task_lock);
-  futex_wake ((int *) &team->barrier.generation, INT_MAX);
+  /* Always set CANCEL on the barrier.  Means we have one simple
atomic
+     operation.  Need an atomic operation because the barrier now
uses the
+     `generation` to communicate.
+
+     Don't need to add any memory-ordering here.  The only thing that
+     BAR_CANCELLED signifies is that the barrier is cancelled and the
only
+     thing that BAR_SECONDARY_CANCELLABLE_ARRIVED signifies is that
the
+     secondary thread has arrived at the barrier.  No thread infers
anything
+     about any other data having been set based on these flags.  */
+  unsigned orig = __atomic_fetch_or (&team->barrier.generation,
BAR_CANCELLED,
+                                    MEMMODEL_RELAXED);
+  /* We're cancelling a barrier and it's currently using the fallback
mechanism
+     instead of the futex_waitv syscall.  We need to ensure that
state gets
+     reset before the next cancellable barrier.
+
+     Any cancelled barrier is followed by a non-cancellable barrier
so just
+     ensuring this reset happens in one thread allows us to ensure
that it
+     happens before any thread reaches the next cancellable barrier.
+
+     This and the check in `gomp_assert_and_increment_cancel_flag`
are the two
+     ways in which we ensure `BAR_SECONDARY_CANCELLABLE_ARRIVED` is
reset by
+     the time that any thread arrives at the next cancellable
barrier.
+
+     We need both since that `BAR_SECONDARY_CANCELLABLE_ARRIVED` flag
could be
+     set *just* after this `BAR_CANCELLED` bit gets set (in which
case the
+     other case handles resetting) or before (in which case this
clause handles
+     resetting).  */
+  if (orig & BAR_SECONDARY_CANCELLABLE_ARRIVED)
+    __atomic_fetch_and (&team->barrier.generation,
+                       ~BAR_SECONDARY_CANCELLABLE_ARRIVED,
MEMMODEL_RELAXED);
+  if (!(orig & BAR_CANCELLED))
+    futex_wake ((int *) &team->barrier.generation, INT_MAX);
  }
diff --git a/libgomp/config/linux/bar.h b/libgomp/config/linux/bar.h
index faa03746d8f..dbd4b868418 100644
--- a/libgomp/config/linux/bar.h
+++ b/libgomp/config/linux/bar.h
@@ -32,14 +32,31 @@

  #include "mutex.h"

+/* Handy to have `cgen` and `gen` separate, since then we can use
+   `__atomic_fetch_add` directly on the `cgen` instead of having to
use a CAS
+   loop to increment the cancel generation bits.  If it weren't for
the
+   fallback for when `futex_waitv` is not available it wouldn't
matter but with
+   that fallback more than one thread can adjust these thread-local
generation
+   numbers and hence we have to be concerned about synchronisation
issues.  */
+struct __attribute__ ((aligned (64))) thread_lock_data
+{
+  unsigned gen;
+  unsigned cgen;
+};
+
  typedef struct
  {
    /* Make sure total/generation is in a mostly read cacheline, while
-     awaited in a separate cacheline.  */
+     awaited in a separate cacheline.  Each generation structure is
in a
+     separate cache line too.  Put both cancellable and non-
cancellable
+     generation numbers in the same cache line because they should
both be
+     only ever modified by their corresponding thread (except in the
case of
+     the primary thread wanting to wait on a given thread arriving at
the
+     barrier and we're on an old Linux kernel).  */
    unsigned total __attribute__((aligned (64)));
+  unsigned allocated;
    unsigned generation;
-  unsigned awaited __attribute__((aligned (64)));
-  unsigned awaited_final;
+  struct thread_lock_data *threadgens;
  } gomp_barrier_t;

  typedef unsigned int gomp_barrier_state_t;
@@ -48,140 +65,416 @@ typedef unsigned int gomp_barrier_state_t;
     low bits dedicated to flags.  Note that TASK_PENDING and WAS_LAST
can
     share space because WAS_LAST is never stored back to generation.
*/
  #define BAR_TASK_PENDING       1
+/* In this particular target BAR_WAS_LAST indicates something more
like
+   "chosen by design to be last", but I like having the macro the
same name as
+   it is given in other targets.  */
  #define BAR_WAS_LAST           1
  #define BAR_WAITING_FOR_TASK   2
  #define BAR_CANCELLED          4
-#define BAR_INCR               8
+/* BAR_SECONDARY_ARRIVED and PRIMARY_WAITING_TG flags are only used
for the
+   fallback approach when `futex_waitv` is not available.  That
syscall should
+   be available on all kernels newer than Linux 5.16.  */
+#define BAR_SECONDARY_ARRIVED 8
+#define BAR_SECONDARY_CANCELLABLE_ARRIVED 16
+/* Using bits 5 -> 10 for the generation number of cancellable
barriers and
+   remaining bits for the generation number of non-cancellable
barriers.  */
+#define BAR_CANCEL_INCR 32
+#define BAR_INCR 2048
+#define BAR_FLAGS_MASK (~(-BAR_CANCEL_INCR))
+#define BAR_GEN_MASK (-BAR_INCR)
+#define BAR_BOTH_GENS_MASK (-BAR_CANCEL_INCR)
+#define BAR_CANCEL_GEN_MASK (-BAR_CANCEL_INCR & (~(-BAR_INCR)))
+/* Increment BAR_CANCEL_INCR, with wrapping arithmetic within the
bits assigned
+   to this generation number.  I.e. Increment, then set bits above
BAR_INCR to
+   what they were before.  */
+#define BAR_INCREMENT_CANCEL(X)
\
+  ({
\
+    __typeof__ (X) _X = (X);
\
+    (((_X + BAR_CANCEL_INCR) & BAR_CANCEL_GEN_MASK)
\
+     | (_X & ~BAR_CANCEL_GEN_MASK));
\
+  })
+/* The thread-local generation field similarly contains a counter in
the high
+   bits and has a few low bits dedicated to flags.  None of the flags
above are
+   used in the thread-local generation field.  Hence we can have a
different
+   set of bits for a protocol between the primary thread and the
secondary
+   threads.  */
+#define PRIMARY_WAITING_TG 1

-static inline void gomp_barrier_init (gomp_barrier_t *bar, unsigned
count)
+static inline void
+gomp_assert_seenflags (gomp_barrier_t *bar, bool cancellable)
  {
+#if _LIBGOMP_CHECKING_
+  unsigned gen = __atomic_load_n (&bar->generation,
MEMMODEL_RELAXED);
+  struct thread_lock_data *arr = bar->threadgens;
+  unsigned cancel_incr = cancellable ? BAR_CANCEL_INCR : 0;
+  unsigned incr = cancellable ? 0 : BAR_INCR;
+  /* Assert that all threads have been seen.  */
+  for (unsigned i = 0; i < bar->total; i++)
+    {
+      gomp_assert (arr[i].gen == (gen & BAR_GEN_MASK) + incr,
+                  "Index %u generation is %u (global is %u)\n", i,
arr[i].gen,
+                  gen);
+      gomp_assert ((arr[i].cgen & BAR_CANCEL_GEN_MASK)
+                    == ((gen + cancel_incr) & BAR_CANCEL_GEN_MASK),
+                  "Index %u cancel generation is %u (global is
%u)\n", i,
+                  arr[i].cgen, gen);
+    }
+
+  /* Assert that generation numbers not corresponding to any thread
are
+     cleared.  This helps us test code-paths.  */
+  for (unsigned i = bar->total; i < bar->allocated; i++)
+    {
+      gomp_assert (arr[i].gen == 0,
+                  "Index %u gen should be 0.  Is %u (global gen is
%u)\n", i,
+                  arr[i].gen, gen);
+      gomp_assert (arr[i].cgen == 0,
+                  "Index %u gen should be 0.  Is %u (global gen is
%u)\n", i,
+                  arr[i].cgen, gen);
+    }
+#endif
+}
+
+static inline void
+gomp_barrier_init (gomp_barrier_t *bar, unsigned count)
+{
+  bar->threadgens
+    = gomp_aligned_alloc (64, sizeof (bar->threadgens[0]) * count);
+  for (unsigned i = 0; i < count; ++i)
+    {
+      bar->threadgens[i].gen = 0;
+      bar->threadgens[i].cgen = 0;
+    }
    bar->total = count;
-  bar->awaited = count;
-  bar->awaited_final = count;
+  bar->allocated = count;
    bar->generation = 0;
  }

-static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned
count)
+static inline bool
+gomp_barrier_has_space (gomp_barrier_t *bar, unsigned nthreads)
  {
-  __atomic_add_fetch (&bar->awaited, count - bar->total,
MEMMODEL_ACQ_REL);
-  bar->total = count;
+  return nthreads <= bar->allocated;
+}
+
+static inline void
+gomp_barrier_minimal_reinit (gomp_barrier_t *bar, unsigned nthreads,
+                            unsigned num_new_threads)
+{
+  /* Just increasing number of threads by appending logically used
threads at
+     "the end" of the team.  That essentially means we need more of
the
+     `bar->threadgens` array to be logically used.  We set them all
to the
+     current `generation` (marking that they are yet to hit this
generation).
+
+     This function has only been called after we checked there is
enough space
+     in this barrier for the number of threads we want using it.
Hence there's
+     no serialisation needed.  */
+  gomp_assert (nthreads <= bar->allocated,
+              "minimal reinit on barrier with not enough space: "
+              "%u > %u",
+              nthreads, bar->allocated);
+  unsigned gen = bar->generation & BAR_GEN_MASK;
+  unsigned cancel_gen = bar->generation & BAR_CANCEL_GEN_MASK;
+  gomp_assert (bar->total == nthreads - num_new_threads,
+              "minimal_reinit called with incorrect state: %u != %u -
%u\n",
+              bar->total, nthreads, num_new_threads);
+  for (unsigned i = bar->total; i < nthreads; i++)
+    {
+      bar->threadgens[i].gen = gen;
+      bar->threadgens[i].cgen = cancel_gen;
+    }
+  bar->total = nthreads;
+}
+
+/* When re-initialising a barrier we know the following:
+   1) We are waiting on a non-cancellable barrier.
+   2) The cancel generation bits are known consistent (having been
tidied up by
+      each individual thread if the barrier got cancelled).  */
+static inline void
+gomp_barrier_reinit_1 (gomp_barrier_t *bar, unsigned nthreads,
+                      unsigned num_new_threads, unsigned long long
*new_ids)
+{
+#if _LIBGOMP_CHECKING_
+  /* Assertions that this barrier is in a sensible state.
+     Everything waiting on the standard barrier.
+     Current thread has not registered itself as arrived, but we
tweak for the
+     current assertions.  */
+  bar->threadgens[0].gen += BAR_INCR;
+  gomp_assert_seenflags (bar, false);
+  bar->threadgens[0].gen -= BAR_INCR;
+  struct thread_lock_data threadgen_zero = bar->threadgens[0];
+#endif
+  if (!gomp_barrier_has_space (bar, nthreads))
+    {
+      /* Using `realloc` not chosen.  Pros/Cons below.
+        Pros of using `realloc`:
+        - May not actually have to move memory.
+        Cons of using `realloc`:
+        - If do have to move memory, then *also* copies data, we are
going to
+          overwrite the data in this function.  That copy would be a
waste.
+        - If do have to move memory then pointer may no longer be
aligned.
+          Would need bookkeeping for "pointer to free" and "pointer
to have
+          data".
+        Seems like "bad" case of `realloc` is made even worse by what
we need
+        here.  Would have to benchmark to figure out whether using
`realloc`
+        or not is best.  Since we shouldn't be re-allocating very
often I'm
+        choosing the simplest to code rather than the most optimal.
+
+        Does not matter that we have any existing threads waiting on
this
+        barrier.  They are all waiting on bar->generation and their
+        thread-local generation will not be looked at.  */
+      gomp_aligned_free (bar->threadgens);
+      bar->threadgens
+       = gomp_aligned_alloc (64, sizeof (bar->threadgens[0]) *
nthreads);
+      bar->allocated = nthreads;
+    }
+
+  /* Re-initialise the existing values.  */
+  unsigned curgen = bar->generation & BAR_GEN_MASK;
+  unsigned cancel_curgen = bar->generation & BAR_CANCEL_GEN_MASK;
+  unsigned iter_len = nthreads;
+  unsigned bits_per_ull = sizeof (unsigned long long) * CHAR_BIT;
+#if _LIBGOMP_CHECKING_
+  /* If checking, zero out everything that's not going to be used in
this team.
+     This is only helpful for debugging (other assertions later can
ensure that
+     we've gone through this path for adjusting the number of
threads, and when
+     viewing the data structure in GDB can easily identify which
generation
+     numbers are in use).  When not running assertions or running in
the
+     debugger these extra numbers are simply not used.  */
+  iter_len = bar->allocated;
+  /* In the checking build just unconditionally reinitialise.  This
handles
+     when the memory has moved and is harmless (except in performance
which the
+     checking build doesn't care about) otherwise.  */
+  bar->threadgens[0] = threadgen_zero;
+#endif
+  for (unsigned i = 1; i < iter_len; i++)
+    {
+      /* Re-initialisation.  Zero out the "remaining" elements in our
wake flag
+        array when _LIBGOMP_CHECKING_ as a helper for our assertions
to check
+        validity.  Set thread-specific generations to "seen" for `i's
+        corresponding to re-used threads, set thread-specific
generations to
+        "not yet seen" for `i's corresponding to threads about to be
+        spawned.  */
+      unsigned newthr_val = i < nthreads ? curgen : 0;
+      unsigned newthr_cancel_val = i < nthreads ? cancel_curgen : 0;
+      unsigned index = i / bits_per_ull;
+      unsigned long long bitmask = (1ULL << (i % bits_per_ull));
+      bool bit_is_set = ((new_ids[index] & bitmask) != 0);
+      bar->threadgens[i].gen = bit_is_set ? curgen + BAR_INCR :
newthr_val;
+      /* This is different because we only ever call this function
while threads
+        are waiting on a non-cancellable barrier.  Hence "which
threads have
+        arrived and which will be newly spawned" is not a question.
*/
+      bar->threadgens[i].cgen = newthr_cancel_val;
+    }
+  bar->total = nthreads;
  }

  static inline void gomp_barrier_destroy (gomp_barrier_t *bar)
  {
+  gomp_aligned_free (bar->threadgens);
  }

-extern void gomp_barrier_wait (gomp_barrier_t *);
-extern void gomp_barrier_wait_last (gomp_barrier_t *);
-extern void gomp_barrier_wait_end (gomp_barrier_t *,
gomp_barrier_state_t);
-extern void gomp_team_barrier_wait (gomp_barrier_t *);
-extern void gomp_team_barrier_wait_final (gomp_barrier_t *);
-extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
-                                       gomp_barrier_state_t);
-extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *);
+static inline void
+gomp_barrier_reinit_2 (gomp_barrier_t __attribute__ ((unused)) * bar,
+                      unsigned __attribute__ ((unused)) nthreads) {};
+extern void gomp_barrier_wait (gomp_barrier_t *, unsigned);
+extern void gomp_barrier_wait_last (gomp_barrier_t *, unsigned);
+extern void gomp_barrier_wait_end (gomp_barrier_t *,
gomp_barrier_state_t,
+                                  unsigned);
+extern void gomp_team_barrier_wait (gomp_barrier_t *, unsigned);
+extern void gomp_team_barrier_wait_final (gomp_barrier_t *,
unsigned);
+extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
gomp_barrier_state_t,
+                                       unsigned);
+extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *,
unsigned);
  extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *,
-                                              gomp_barrier_state_t);
+                                              gomp_barrier_state_t,
unsigned);
  extern void gomp_team_barrier_wake (gomp_barrier_t *, int);
  struct gomp_team;
  extern void gomp_team_barrier_cancel (struct gomp_team *);
+extern void gomp_team_barrier_ensure_last (gomp_barrier_t *,
unsigned,
+                                          gomp_barrier_state_t);
+extern void gomp_barrier_ensure_last (gomp_barrier_t *, unsigned,
+                                     gomp_barrier_state_t);
+extern bool gomp_team_barrier_ensure_cancel_last (gomp_barrier_t *,
unsigned,
+
gomp_barrier_state_t);
+extern void gomp_assert_and_increment_flag (gomp_barrier_t *,
unsigned,
+                                           unsigned);
+extern void gomp_assert_and_increment_cancel_flag (gomp_barrier_t *,
unsigned,
+                                                  unsigned);

  static inline gomp_barrier_state_t
-gomp_barrier_wait_start (gomp_barrier_t *bar)
+gomp_barrier_wait_start (gomp_barrier_t *bar, unsigned id)
  {
+  /* TODO I don't believe this MEMMODEL_ACQUIRE is needed.
+     Look into it later.  Point being that this should only ever read
a value
+     from last barrier or from tasks/cancellation/etc.  There was
already an
+     acquire-release ordering at exit of the last barrier, all
setting of
+     tasks/cancellation etc are done with RELAXED memory model =>
using ACQUIRE
+     doesn't help.
+
+     See corresponding comment in `gomp_team_barrier_cancel` when
thinking
+     about this.  */
    unsigned int ret = __atomic_load_n (&bar->generation,
MEMMODEL_ACQUIRE);
-  ret &= -BAR_INCR | BAR_CANCELLED;
-  /* A memory barrier is needed before exiting from the various forms
-     of gomp_barrier_wait, to satisfy OpenMP API version 3.1 section
-     2.8.6 flush Construct, which says there is an implicit flush
during
-     a barrier region.  This is a convenient place to add the
barrier,
-     so we use MEMMODEL_ACQ_REL here rather than MEMMODEL_ACQUIRE.
*/
-  if (__atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0)
+  ret &= (BAR_BOTH_GENS_MASK | BAR_CANCELLED);
+#if !_LIBGOMP_CHECKING_
+  if (id != 0)
+#endif
+    /* Increment local flag.  For thread id 0 this doesn't
communicate
+       anything to *other* threads, but it is useful for debugging
purposes.  */
+    gomp_assert_and_increment_flag (bar, id, ret);
+
+  if (id == 0)
      ret |= BAR_WAS_LAST;
    return ret;
  }

  static inline gomp_barrier_state_t
-gomp_barrier_wait_cancel_start (gomp_barrier_t *bar)
-{
-  return gomp_barrier_wait_start (bar);
-}
-
-/* This is like gomp_barrier_wait_start, except it decrements
-   bar->awaited_final rather than bar->awaited and should be used
-   for the gomp_team_end barrier only.  */
-static inline gomp_barrier_state_t
-gomp_barrier_wait_final_start (gomp_barrier_t *bar)
+gomp_barrier_wait_cancel_start (gomp_barrier_t *bar, unsigned id)
  {
    unsigned int ret = __atomic_load_n (&bar->generation,
MEMMODEL_ACQUIRE);
-  ret &= -BAR_INCR | BAR_CANCELLED;
-  /* See above gomp_barrier_wait_start comment.  */
-  if (__atomic_add_fetch (&bar->awaited_final, -1, MEMMODEL_ACQ_REL)
== 0)
+  ret &= BAR_BOTH_GENS_MASK | BAR_CANCELLED;
+  if (!(ret & BAR_CANCELLED)
+#if !_LIBGOMP_CHECKING_
+      && id != 0
+#endif
+  )
+    gomp_assert_and_increment_cancel_flag (bar, id, ret);
+  if (id == 0)
      ret |= BAR_WAS_LAST;
    return ret;
  }

+static inline void
+gomp_reset_cancellable_primary_threadgen (gomp_barrier_t *bar,
unsigned id)
+{
+#if _LIBGOMP_CHECKING_
+  gomp_assert (id == 0,
+              "gomp_reset_cancellable_primary_threadgen called with "
+              "non-primary thread id: %u",
+              id);
+  unsigned orig = __atomic_fetch_sub (&bar->threadgens[id].cgen,
+                                     BAR_CANCEL_INCR,
MEMMODEL_RELAXED);
+  unsigned orig_gen = (orig & BAR_CANCEL_GEN_MASK);
+  unsigned gen = __atomic_load_n (&bar->generation,
MEMMODEL_RELAXED);
+  unsigned global_gen_plus1 = ((gen + BAR_CANCEL_INCR) &
BAR_CANCEL_GEN_MASK);
+  gomp_assert (orig_gen == global_gen_plus1,
+              "Thread-local generation %u unknown modification:"
+              " expected %u seen %u",
+              id, (gen + BAR_CANCEL_GEN_MASK) & BAR_CANCEL_GEN_MASK,
orig);
+#endif
+}
+
  static inline bool
  gomp_barrier_last_thread (gomp_barrier_state_t state)
  {
    return state & BAR_WAS_LAST;
  }

-/* All the inlines below must be called with team->task_lock
-   held.  */
+/* All the inlines below must be called with team->task_lock held.
However
+   with the `futex_waitv` fallback there can still be contention on
+   `bar->generation`.  For the RMW operations it is obvious that we
need to
+   perform these atomically.  For the load in
`gomp_barrier_has_completed` and
+   `gomp_team_barrier_cancelled` the need is tied to the C
specification.
+
+   On the architectures I have some grips on (x86_64 and AArch64)
there would
+   be no real downside to setting a bit in thread while reading in
another.
+   However the C definition of data race doesn't have any such leeway
and to
+   avoid UB we need to load atomically.  */

  static inline void
  gomp_team_barrier_set_task_pending (gomp_barrier_t *bar)
  {
-  bar->generation |= BAR_TASK_PENDING;
+  __atomic_fetch_or (&bar->generation, BAR_TASK_PENDING,
MEMMODEL_RELAXED);
  }

  static inline void
  gomp_team_barrier_clear_task_pending (gomp_barrier_t *bar)
  {
-  bar->generation &= ~BAR_TASK_PENDING;
+  __atomic_fetch_and (&bar->generation, ~BAR_TASK_PENDING,
MEMMODEL_RELAXED);
  }

  static inline void
  gomp_team_barrier_set_waiting_for_tasks (gomp_barrier_t *bar)
  {
-  bar->generation |= BAR_WAITING_FOR_TASK;
+  __atomic_fetch_or (&bar->generation, BAR_WAITING_FOR_TASK,
MEMMODEL_RELAXED);
  }

  static inline bool
  gomp_team_barrier_waiting_for_tasks (gomp_barrier_t *bar)
  {
-  return (bar->generation & BAR_WAITING_FOR_TASK) != 0;
+  unsigned gen = __atomic_load_n (&bar->generation,
MEMMODEL_RELAXED);
+  return (gen & BAR_WAITING_FOR_TASK) != 0;
  }

  static inline bool
  gomp_team_barrier_cancelled (gomp_barrier_t *bar)
  {
-  return __builtin_expect ((bar->generation & BAR_CANCELLED) != 0,
0);
+  unsigned gen = __atomic_load_n (&bar->generation,
MEMMODEL_RELAXED);
+  return __builtin_expect ((gen & BAR_CANCELLED) != 0, 0);
+}
+
+static inline unsigned
+gomp_increment_gen (gomp_barrier_state_t state, bool use_cancel)
+{
+  unsigned gens = (state & BAR_BOTH_GENS_MASK);
+  return use_cancel ? BAR_INCREMENT_CANCEL (gens) : gens + BAR_INCR;
  }

  static inline void
-gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t
state)
+gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t
state,
+                       bool use_cancel)
  {
    /* Need the atomic store for acquire-release synchronisation with
the
       load in `gomp_team_barrier_wait_{cancel_,}end`.  See PR112356
*/
-  __atomic_store_n (&bar->generation, (state & -BAR_INCR) + BAR_INCR,
-                   MEMMODEL_RELEASE);
+  unsigned next = gomp_increment_gen (state, use_cancel);
+  __atomic_store_n (&bar->generation, next, MEMMODEL_RELEASE);
  }

  static inline bool
  gomp_barrier_state_is_incremented (gomp_barrier_state_t gen,
-                                  gomp_barrier_state_t state)
+                                  gomp_barrier_state_t state, bool
use_cancel)
  {
-  unsigned next_state = (state & -BAR_INCR) + BAR_INCR;
-  return next_state > state ? gen >= next_state : gen < state;
+  unsigned next = gomp_increment_gen (state, use_cancel);
+  return next > state ? gen >= next : gen < state;
  }

  static inline bool
-gomp_barrier_has_completed (gomp_barrier_state_t state,
gomp_barrier_t *bar)
+gomp_barrier_has_completed (gomp_barrier_state_t state,
gomp_barrier_t *bar,
+                           bool use_cancel)
  {
    /* Handling overflow in the generation.  The "next" state could be
less than
       or greater than the current one.  */
-  return gomp_barrier_state_is_incremented (bar->generation, state);
+  unsigned curgen = __atomic_load_n (&bar->generation,
MEMMODEL_RELAXED);
+  return gomp_barrier_state_is_incremented (curgen, state,
use_cancel);
+}
+
+static inline void
+gomp_barrier_prepare_reinit (gomp_barrier_t *bar, unsigned id)
+{
+  gomp_assert (id == 0,
+              "gomp_barrier_prepare_reinit called in non-primary
thread: %u",
+              id);
+  /* This use of `gomp_barrier_wait_start` is worth note.
+     1) We're running in `id == 0`, which means that without checking
we'll
+       essentially just load `bar->generation`.
+     2) In this case there's no need to form any release-acquire
ordering.  The
+       `gomp_barrier_ensure_last` call below will form a release-
acquire
+       ordering between each secondary thread and this one, and that
will be
+       from some point after all uses of the barrier that we care
about.
+     3) However, in the checking builds, it's very useful to call
+       `gomp_assert_and_increment_flag` in order to provide extra
guarantees
+       about what we're doing.  */
+  gomp_barrier_state_t state = gomp_barrier_wait_start (bar, id);
+  gomp_barrier_ensure_last (bar, id, state);
+#if _LIBGOMP_CHECKING_
+  /* When checking, `gomp_assert_and_increment_flag` will have
incremented the
+     generation flag.  However later on down the line we'll be
calling the full
+     barrier again and we need to decrement that flag ready for that.
We still
+     *want* the flag to have been incremented above so that the
assertions in
+     `gomp_barrier_ensure_last` all work.
+
+     When not checking, this increment/decrement/increment again
cycle is not
+     performed.  */
+  bar->threadgens[0].gen -= BAR_INCR;
+#endif
  }

  #endif /* GOMP_BARRIER_H */
diff --git a/libgomp/config/linux/futex_waitv.h
b/libgomp/config/linux/futex_waitv.h
new file mode 100644
index 00000000000..c9780b5e9f0
--- /dev/null
+++ b/libgomp/config/linux/futex_waitv.h
@@ -0,0 +1,129 @@
+/* Copyright The GNU Toolchain Authors.
+
+   This file is part of the GNU Offloading and Multi Processing
Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your
option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but
WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception,
version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License
and
+   a copy of the GCC Runtime Library Exception along with this
program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not,
see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Only defining an interface that we need.  `waitv` can take many
more
+   addresses but we only use two.  We define a fallback for when
`futex_waitv`
+   is not available.  This rather than define another target in the
config/
+   directory that looks much the same as the linux one except for the
parts
+   handling `futex_waitv`, `PRIMARY_WAITING_TG`, and
+   `BAR_SECONDARY_{CANCELLABLE_,}ARRIVED`.  */
+
+#define _GNU_SOURCE
+#include <sys/syscall.h>
+
+#ifdef SYS_futex_waitv
+#pragma GCC visibility push(default)
+
+#include <unistd.h>
+#include <stdint.h>
+#include <string.h>
+#include <errno.h>
+#include <time.h>
+
+#pragma GCC visibility pop
+
+struct futex_waitv
+{
+  uint64_t val;
+  uint64_t uaddr;
+  uint32_t flags;
+  uint32_t __reserved;
+};
+
+static inline void
+futex_waitv (int *addr, int val, int *addr2, int val2)
+{
+  gomp_assert (gomp_thread ()->ts.team_id == 0,
+              "Called futex_waitv from secondary thread %d\n",
+              gomp_thread ()->ts.team_id);
+  struct futex_waitv addrs[2];
+  addrs[0].val = val;
+  addrs[0].uaddr = (uint64_t) (uintptr_t) addr;
+  /* Using same internally-defined flags as futex.h does.  These are
defined in
+     wait.h.  */
+  addrs[0].flags = FUTEX_PRIVATE_FLAG | FUTEX_32;
+  addrs[0].__reserved = 0;
+  addrs[1].val = val2;
+  addrs[1].uaddr = (uint64_t) (uintptr_t) addr2;
+  addrs[1].flags = FUTEX_PRIVATE_FLAG | FUTEX_32;
+  addrs[1].__reserved = 0;
+  int err __attribute__ ((unused))
+  = syscall (SYS_futex_waitv, addrs, 2, 0, NULL, CLOCK_MONOTONIC);
+  /* If a signal woke us then we simply leave and let the loop
outside of us
+     handle it.  We never require knowledge about whether anything
changed or
+     not.  */
+  gomp_assert (err >= 0 || errno == EAGAIN || errno == EINTR,
+              "Failed with futex_waitv err = %d, message: %s", err,
+              strerror (errno));
+}
+
+#else
+
+static inline void
+futex_waitv (int *addr, int val, int *addr2, int val2)
+{
+  int threadlocal
+    = __atomic_fetch_or (addr, PRIMARY_WAITING_TG, MEMMODEL_RELAXED);
+  /* futex_wait can be woken up because of a BAR_TASK_PENDING being
set or
+     the like.  In that case we might come back here after checking
+     variables again.
+     If in between checking variables and coming back here the other
thread
+     arrived this variable could have been incremented.
+     Hence possible variables are:
+     - val
+     - val + BAR_INCR
+     - val | PRIMARY_WAITING_TG
+     - val + BAR_INCR | PRIMARY_WAITING_TG
+
+     If the `PRIMARY_WAITING_TG` flag is set, then the trigger for
"we can
+     proceed" is now `BAR_SECONDARY_ARRIVED` being set on the
generation
+     number.  */
+  if (__builtin_expect (threadlocal != val, 0)
+      && !(threadlocal & PRIMARY_WAITING_TG))
+    {
+      /* Secondary thread reached this point before us.  Know that
secondary
+        will not modify this variable again until we've "released" it
from
+        this barrier.  Hence can simply reset the thread-local
variable and
+        continue.
+
+        It's worth mentioning this implementation interacts directly
with what
+        is handled in bar.c.  That's not a great separation of
concerns.
+        I believe I need things that way, but would be nice if I
could make
+        the separation neat.  ??? That also might allow passing some
+        information down about whether we're working on the
cancellable or
+        non-cancellable generation numbers.  Then would be able to
restrict
+        the below assertion to the only value that is valid (for
whichever set
+        of generation numbers we have).  */
+      gomp_assert (threadlocal == (val + BAR_INCR)
+                    || ((threadlocal & BAR_CANCEL_GEN_MASK)
+                        == (BAR_INCREMENT_CANCEL (val) &
BAR_CANCEL_GEN_MASK)),
+                  "threadlocal generation number odd:  %d  (expected
%d)",
+                  threadlocal, val);
+      __atomic_store_n (addr, threadlocal, MEMMODEL_RELAXED);
+      return;
+    }
+  futex_wait (addr2, val2);
+}
+
+#endif
diff --git a/libgomp/config/linux/wait.h b/libgomp/config/linux/wait.h
index c15e035dd56..c197da5ea24 100644
--- a/libgomp/config/linux/wait.h
+++ b/libgomp/config/linux/wait.h
@@ -35,6 +35,8 @@

  #define FUTEX_WAIT     0
  #define FUTEX_WAKE     1
+
+#define FUTEX_32 2
  #define FUTEX_PRIVATE_FLAG     128

  #ifdef HAVE_ATTRIBUTE_VISIBILITY
@@ -45,14 +47,21 @@ extern int gomp_futex_wait, gomp_futex_wake;

  #include <futex.h>

-static inline int do_spin (int *addr, int val)
+static inline unsigned long long
+spin_count ()
  {
-  unsigned long long i, count = gomp_spin_count_var;
-
+  unsigned long long count = gomp_spin_count_var;
    if (__builtin_expect (__atomic_load_n (&gomp_managed_threads,
                                           MEMMODEL_RELAXED)
                          > gomp_available_cpus, 0))
      count = gomp_throttled_spin_count_var;
+  return count;
+}
+
+static inline int
+do_spin (int *addr, int val)
+{
+  unsigned long long i, count = spin_count ();
    for (i = 0; i < count; i++)
      if (__builtin_expect (__atomic_load_n (addr, MEMMODEL_RELAXED) !=
val, 0))
        return 0;
diff --git a/libgomp/config/posix/bar.c b/libgomp/config/posix/bar.c
index a86b2f38c2d..6c8a4c6d7d2 100644
--- a/libgomp/config/posix/bar.c
+++ b/libgomp/config/posix/bar.c
@@ -105,13 +105,14 @@ gomp_barrier_wait_end (gomp_barrier_t *bar,
gomp_barrier_state_t state)
  }

  void
-gomp_barrier_wait (gomp_barrier_t *barrier)
+gomp_barrier_wait (gomp_barrier_t *barrier, unsigned id)
  {
-  gomp_barrier_wait_end (barrier, gomp_barrier_wait_start (barrier));
+  gomp_barrier_wait_end (barrier, gomp_barrier_wait_start (barrier,
id));
  }

  void
-gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
state)
+gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
state,
+                           unsigned id __attribute__ ((unused)))
  {
    unsigned int n;

@@ -127,7 +128,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar,
gomp_barrier_state_t state)
         = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
        if (task_count)
         {
-         gomp_barrier_handle_tasks (state);
+         gomp_barrier_handle_tasks (state, false);
           if (n > 0)
             gomp_sem_wait (&bar->sem2);
           gomp_mutex_unlock (&bar->mutex1);
@@ -154,7 +155,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar,
gomp_barrier_state_t state)
           gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
           if (gen & BAR_TASK_PENDING)
             {
-             gomp_barrier_handle_tasks (state);
+             gomp_barrier_handle_tasks (state, false);
               gen = __atomic_load_n (&bar->generation,
MEMMODEL_ACQUIRE);
             }
         }
@@ -175,7 +176,8 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar,
gomp_barrier_state_t state)

  bool
  gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
-                                  gomp_barrier_state_t state)
+                                  gomp_barrier_state_t state,
+                                  unsigned id __attribute__
((unused)))
  {
    unsigned int n;

@@ -191,7 +193,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t
*bar,
         = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
        if (task_count)
         {
-         gomp_barrier_handle_tasks (state);
+         gomp_barrier_handle_tasks (state, true);
           if (n > 0)
             gomp_sem_wait (&bar->sem2);
           gomp_mutex_unlock (&bar->mutex1);
@@ -226,7 +228,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t
*bar,
             break;
           if (gen & BAR_TASK_PENDING)
             {
-             gomp_barrier_handle_tasks (state);
+             gomp_barrier_handle_tasks (state, true);
               gen = __atomic_load_n (&bar->generation,
MEMMODEL_ACQUIRE);
               if (gen & BAR_CANCELLED)
                 break;
@@ -251,9 +253,10 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t
*bar,
  }

  void
-gomp_team_barrier_wait (gomp_barrier_t *barrier)
+gomp_team_barrier_wait (gomp_barrier_t *barrier, unsigned id)
  {
-  gomp_team_barrier_wait_end (barrier, gomp_barrier_wait_start
(barrier));
+  gomp_team_barrier_wait_end (barrier, gomp_barrier_wait_start
(barrier, id),
+                             id);
  }

  void
@@ -266,10 +269,10 @@ gomp_team_barrier_wake (gomp_barrier_t *bar, int
count)
  }

  bool
-gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
+gomp_team_barrier_wait_cancel (gomp_barrier_t *bar, unsigned id)
  {
-  gomp_barrier_state_t state = gomp_barrier_wait_cancel_start (bar);
-  return gomp_team_barrier_wait_cancel_end (bar, state);
+  gomp_barrier_state_t state = gomp_barrier_wait_cancel_start (bar,
id);
+  return gomp_team_barrier_wait_cancel_end (bar, state, id);
  }

  void
diff --git a/libgomp/config/posix/bar.h b/libgomp/config/posix/bar.h
index 35b94e43ce2..928b12a14ff 100644
--- a/libgomp/config/posix/bar.h
+++ b/libgomp/config/posix/bar.h
@@ -62,20 +62,21 @@ extern void gomp_barrier_init (gomp_barrier_t *,
unsigned);
  extern void gomp_barrier_reinit (gomp_barrier_t *, unsigned);
  extern void gomp_barrier_destroy (gomp_barrier_t *);

-extern void gomp_barrier_wait (gomp_barrier_t *);
+extern void gomp_barrier_wait (gomp_barrier_t *, unsigned);
  extern void gomp_barrier_wait_end (gomp_barrier_t *,
gomp_barrier_state_t);
-extern void gomp_team_barrier_wait (gomp_barrier_t *);
-extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
-                                       gomp_barrier_state_t);
-extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *);
+extern void gomp_team_barrier_wait (gomp_barrier_t *, unsigned);
+extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
gomp_barrier_state_t,
+                                       unsigned);
+extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *,
unsigned);
  extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *,
-                                              gomp_barrier_state_t);
+                                              gomp_barrier_state_t,
unsigned);
  extern void gomp_team_barrier_wake (gomp_barrier_t *, int);
  struct gomp_team;
  extern void gomp_team_barrier_cancel (struct gomp_team *);

  static inline gomp_barrier_state_t
-gomp_barrier_wait_start (gomp_barrier_t *bar)
+gomp_barrier_wait_start (gomp_barrier_t *bar,
+                        unsigned id __attribute__ ((unused)))
  {
    unsigned int ret;
    gomp_mutex_lock (&bar->mutex1);
@@ -86,7 +87,8 @@ gomp_barrier_wait_start (gomp_barrier_t *bar)
  }

  static inline gomp_barrier_state_t
-gomp_barrier_wait_cancel_start (gomp_barrier_t *bar)
+gomp_barrier_wait_cancel_start (gomp_barrier_t *bar,
+                               unsigned id __attribute__ ((unused)))
  {
    unsigned int ret;
    gomp_mutex_lock (&bar->mutex1);
@@ -99,9 +101,9 @@ gomp_barrier_wait_cancel_start (gomp_barrier_t
*bar)
  }

  static inline void
-gomp_team_barrier_wait_final (gomp_barrier_t *bar)
+gomp_team_barrier_wait_final (gomp_barrier_t *bar, unsigned id)
  {
-  gomp_team_barrier_wait (bar);
+  gomp_team_barrier_wait (bar, id);
  }

  static inline bool
@@ -111,9 +113,9 @@ gomp_barrier_last_thread (gomp_barrier_state_t
state)
  }

  static inline void
-gomp_barrier_wait_last (gomp_barrier_t *bar)
+gomp_barrier_wait_last (gomp_barrier_t *bar, unsigned id)
  {
-  gomp_barrier_wait (bar);
+  gomp_barrier_wait (bar, id);
  }

  /* All the inlines below must be called with team->task_lock
@@ -150,7 +152,8 @@ gomp_team_barrier_cancelled (gomp_barrier_t *bar)
  }

  static inline void
-gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t
state)
+gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t
state,
+                       unsigned use_cancel __attribute__ ((unused)))
  {
    /* Need the atomic store for acquire-release synchronisation with
the
       load in `gomp_team_barrier_wait_{cancel_,}end`.  See PR112356
*/
@@ -167,11 +170,81 @@ gomp_barrier_state_is_incremented
(gomp_barrier_state_t gen,
  }

  static inline bool
-gomp_barrier_has_completed (gomp_barrier_state_t state,
gomp_barrier_t *bar)
+gomp_barrier_has_completed (gomp_barrier_state_t state,
gomp_barrier_t *bar,
+                           bool use_cancel __attribute__ ((unused)))
  {
    /* Handling overflow in the generation.  The "next" state could be
less than
       or greater than the current one.  */
    return gomp_barrier_state_is_incremented (bar->generation, state);
  }

+/* Functions dummied out for this implementation.  */
+static inline void
+gomp_barrier_prepare_reinit (gomp_barrier_t *bar __attribute__
((unused)),
+                            unsigned id __attribute__ ((unused)))
+{}
+
+static inline void
+gomp_barrier_minimal_reinit (gomp_barrier_t *bar, unsigned nthreads,
+                            unsigned num_new_threads __attribute__
((unused)))
+{
+  gomp_barrier_reinit (bar, nthreads);
+}
+
+static inline void
+gomp_barrier_reinit_1 (gomp_barrier_t *bar, unsigned nthreads,
+                      unsigned num_new_threads,
+                      unsigned long long *new_ids __attribute__
((unused)))
+{
+  if (num_new_threads)
+    {
+      gomp_mutex_lock (&bar->mutex1);
+      bar->total += num_new_threads;
+      gomp_mutex_unlock (&bar->mutex1);
+    }
+}
+
+static inline void
+gomp_barrier_reinit_2 (gomp_barrier_t *bar, unsigned nthreads)
+{
+  gomp_barrier_reinit (bar, nthreads);
+}
+
+static inline bool
+gomp_barrier_has_space (gomp_barrier_t *bar __attribute__ ((unused)),
+                       unsigned nthreads __attribute__ ((unused)))
+{
+  /* Space to handle `nthreads`.  Only thing that we need is to set
bar->total
+     to `nthreads`.  Can always do that.  */
+  return true;
+}
+
+static inline void
+gomp_team_barrier_ensure_last (gomp_barrier_t *bar __attribute__
((unused)),
+                              unsigned id __attribute__ ((unused)),
+                              gomp_barrier_state_t state
+                              __attribute__ ((unused)))
+{}
+
+static inline bool
+gomp_team_barrier_ensure_cancel_last (gomp_barrier_t *bar
+                                     __attribute__ ((unused)),
+                                     unsigned id __attribute__
((unused)),
+                                     gomp_barrier_state_t state
+                                     __attribute__ ((unused)))
+{
+  /* After returning BAR_WAS_LAST, actually ensure that this thread
is last.
+     Return `true` if this thread is known last into the barrier
return `false`
+     if the barrier got cancelled such that not all threads entered
the barrier.
+
+     Since BAR_WAS_LAST is only set for a thread when that thread
decremented
+     the `awaited` counter to zero we know that all threads must have
entered
+     the barrier.  Hence always return `true`.  */
+  return true;
+}
+
+static inline void
+gomp_reset_cancellable_primary_threadgen (gomp_barrier_t *bar,
unsigned id)
+{}
+
  #endif /* GOMP_BARRIER_H */
diff --git a/libgomp/config/posix/simple-bar.h
b/libgomp/config/posix/simple-bar.h
index 7b4b7e43ea6..12abd0512e8 100644
--- a/libgomp/config/posix/simple-bar.h
+++ b/libgomp/config/posix/simple-bar.h
@@ -43,9 +43,18 @@ gomp_simple_barrier_init (gomp_simple_barrier_t
*bar, unsigned count)
  }

  static inline void
-gomp_simple_barrier_reinit (gomp_simple_barrier_t *bar, unsigned
count)
+gomp_simple_barrier_minimal_reinit (gomp_simple_barrier_t *bar,
+                                   unsigned nthreads, unsigned
num_new_threads)
  {
-  gomp_barrier_reinit (&bar->bar, count);
+  gomp_barrier_minimal_reinit (&bar->bar, nthreads, num_new_threads);
+}
+
+static inline void
+gomp_simple_barrier_reinit_1 (gomp_simple_barrier_t *bar, unsigned
nthreads,
+                             unsigned num_new_threads,
+                             unsigned long long *new_ids)
+{
+  gomp_barrier_reinit_1 (&bar->bar, nthreads, num_new_threads,
new_ids);
  }

  static inline void
@@ -55,15 +64,33 @@ gomp_simple_barrier_destroy (gomp_simple_barrier_t
*bar)
  }

  static inline void
-gomp_simple_barrier_wait (gomp_simple_barrier_t *bar)
+gomp_simple_barrier_wait (gomp_simple_barrier_t *bar, unsigned id)
+{
+  gomp_barrier_wait (&bar->bar, id);
+}
+
+static inline void
+gomp_simple_barrier_wait_last (gomp_simple_barrier_t *bar, unsigned
id)
  {
-  gomp_barrier_wait (&bar->bar);
+  gomp_barrier_wait_last (&bar->bar, id);
  }

  static inline void
-gomp_simple_barrier_wait_last (gomp_simple_barrier_t *bar)
+gomp_simple_barrier_prepare_reinit (gomp_simple_barrier_t *sbar,
unsigned id)
+{
+  gomp_barrier_prepare_reinit (&sbar->bar, id);
+}
+
+static inline void
+gomp_simple_barrier_reinit_2 (gomp_simple_barrier_t *sbar, unsigned
nthreads)
+{
+  gomp_barrier_reinit_2 (&sbar->bar, nthreads);
+}
+
+static inline bool
+gomp_simple_barrier_has_space (gomp_simple_barrier_t *sbar, unsigned
nthreads)
  {
-  gomp_barrier_wait_last (&bar->bar);
+  return gomp_barrier_has_space (&sbar->bar, nthreads);
  }

  #endif /* GOMP_SIMPLE_BARRIER_H */
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index a43398300a5..e0459046bc9 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -59,6 +59,7 @@
  #include <stdbool.h>
  #include <stdlib.h>
  #include <stdarg.h>
+#include <limits.h>

  /* Needed for memset in priority_queue.c.  */
  #if _LIBGOMP_CHECKING_
@@ -200,6 +201,14 @@ extern void gomp_vfatal (const char *, va_list)
  extern void gomp_fatal (const char *, ...)
         __attribute__ ((noreturn, format (printf, 1, 2)));

+#if _LIBGOMP_CHECKING_
+#define gomp_assert(EXPR, MSG, ...)
\
+  if (!(EXPR))
\
+  gomp_fatal ("%s:%d " MSG, __FILE__, __LINE__, __VA_ARGS__)
+#else
+#define gomp_assert(...)
+#endif
+
  struct gomp_task;
  struct gomp_taskgroup;
  struct htab;
@@ -1099,7 +1108,7 @@ extern unsigned gomp_dynamic_max_threads (void);
  extern void gomp_init_task (struct gomp_task *, struct gomp_task *,
                             struct gomp_task_icv *);
  extern void gomp_end_task (void);
-extern void gomp_barrier_handle_tasks (gomp_barrier_state_t);
+extern void gomp_barrier_handle_tasks (gomp_barrier_state_t, bool);
  extern void gomp_task_maybe_wait_for_dependencies (void **);
  extern bool gomp_create_target_task (struct gomp_device_descr *,
                                      void (*) (void *), size_t, void
**,
diff --git a/libgomp/single.c b/libgomp/single.c
index 397501daeb3..abba2976586 100644
--- a/libgomp/single.c
+++ b/libgomp/single.c
@@ -77,7 +77,7 @@ GOMP_single_copy_start (void)
      }
    else
      {
-      gomp_team_barrier_wait (&thr->ts.team->barrier);
+      gomp_team_barrier_wait (&thr->ts.team->barrier, thr-

ts.team_id);


        ret = thr->ts.work_share->copyprivate;
        gomp_work_share_end_nowait ();
@@ -98,7 +98,7 @@ GOMP_single_copy_end (void *data)
    if (team != NULL)
      {
        thr->ts.work_share->copyprivate = data;
-      gomp_team_barrier_wait (&team->barrier);
+      gomp_team_barrier_wait (&team->barrier, thr->ts.team_id);
      }

    gomp_work_share_end_nowait ();
diff --git a/libgomp/task.c b/libgomp/task.c
index 5965e781f7e..658b51c1fd2 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -1549,7 +1549,7 @@ gomp_task_run_post_remove_taskgroup (struct
gomp_task *child_task)
  }

  void
-gomp_barrier_handle_tasks (gomp_barrier_state_t state)
+gomp_barrier_handle_tasks (gomp_barrier_state_t state, bool
use_cancel)
  {
    struct gomp_thread *thr = gomp_thread ();
    struct gomp_team *team = thr->ts.team;
@@ -1570,7 +1570,7 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t
state)
       When `task_count == 0` we're not going to perform tasks anyway,
so the
       problem of PR122314 is naturally avoided.  */
    if (team->task_count != 0
-      && gomp_barrier_has_completed (state, &team->barrier))
+      && gomp_barrier_has_completed (state, &team->barrier,
use_cancel))
      {
        gomp_mutex_unlock (&team->task_lock);
        return;
@@ -1580,7 +1580,7 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t
state)
      {
        if (team->task_count == 0)
         {
-         gomp_team_barrier_done (&team->barrier, state);
+         gomp_team_barrier_done (&team->barrier, state, use_cancel);
           gomp_mutex_unlock (&team->task_lock);
           gomp_team_barrier_wake (&team->barrier, 0);
           return;
@@ -1617,7 +1617,7 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t
state)
        else if (team->task_count == 0
                && gomp_team_barrier_waiting_for_tasks (&team-

barrier))

         {
-         gomp_team_barrier_done (&team->barrier, state);
+         gomp_team_barrier_done (&team->barrier, state, use_cancel);
           gomp_mutex_unlock (&team->task_lock);
           gomp_team_barrier_wake (&team->barrier, 0);
           if (to_free)
@@ -2243,7 +2243,7 @@ GOMP_taskgroup_end (void)
          is #pragma omp target nowait that creates an implicit
          team with a single thread.  In this case, we want to wait
          for all outstanding tasks in this team.  */
-      gomp_team_barrier_wait (&team->barrier);
+      gomp_team_barrier_wait (&team->barrier, thr->ts.team_id);
        return;
      }

@@ -2698,7 +2698,7 @@ GOMP_workshare_task_reduction_unregister (bool
cancelled)
      htab_free ((struct htab *) data[5]);

    if (!cancelled)
-    gomp_team_barrier_wait (&team->barrier);
+    gomp_team_barrier_wait (&team->barrier, thr->ts.team_id);
  }

  int
diff --git a/libgomp/team.c b/libgomp/team.c
index cb1d3235312..512b1368af6 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -109,28 +109,28 @@ gomp_thread_start (void *xdata)
        struct gomp_team *team = thr->ts.team;
        struct gomp_task *task = thr->task;

-      gomp_barrier_wait (&team->barrier);
+      gomp_barrier_wait (&team->barrier, thr->ts.team_id);

        local_fn (local_data);
-      gomp_team_barrier_wait_final (&team->barrier);
+      gomp_team_barrier_wait_final (&team->barrier, thr->ts.team_id);
        gomp_finish_task (task);
-      gomp_barrier_wait_last (&team->barrier);
+      gomp_barrier_wait_last (&team->barrier, thr->ts.team_id);
      }
    else
      {
        pool->threads[thr->ts.team_id] = thr;

-      gomp_simple_barrier_wait (&pool->threads_dock);
+      gomp_simple_barrier_wait (&pool->threads_dock, thr-

ts.team_id);

        do
         {
           struct gomp_team *team = thr->ts.team;
           struct gomp_task *task = thr->task;

           local_fn (local_data);
-         gomp_team_barrier_wait_final (&team->barrier);
+         gomp_team_barrier_wait_final (&team->barrier, thr-

ts.team_id);

           gomp_finish_task (task);

-         gomp_simple_barrier_wait (&pool->threads_dock);
+         gomp_simple_barrier_wait (&pool->threads_dock, thr-

ts.team_id);


           local_fn = thr->fn;
           local_data = thr->data;
@@ -243,7 +243,7 @@ gomp_free_pool_helper (void *thread_pool)
    struct gomp_thread *thr = gomp_thread ();
    struct gomp_thread_pool *pool
      = (struct gomp_thread_pool *) thread_pool;
-  gomp_simple_barrier_wait_last (&pool->threads_dock);
+  gomp_simple_barrier_wait_last (&pool->threads_dock, thr-

ts.team_id);

    gomp_sem_destroy (&thr->release);
    thr->thread_pool = NULL;
    thr->task = NULL;
@@ -278,10 +278,10 @@ gomp_free_thread (void *arg
__attribute__((unused)))
               nthr->data = pool;
             }
           /* This barrier undocks threads docked on pool-

threads_dock.  */

-         gomp_simple_barrier_wait (&pool->threads_dock);
+         gomp_simple_barrier_wait (&pool->threads_dock, thr-

ts.team_id);

           /* And this waits till all threads have called
gomp_barrier_wait_last
              in gomp_free_pool_helper.  */
-         gomp_simple_barrier_wait (&pool->threads_dock);
+         gomp_simple_barrier_wait (&pool->threads_dock, thr-

ts.team_id);

           /* Now it is safe to destroy the barrier and free the pool.
*/
           gomp_simple_barrier_destroy (&pool->threads_dock);

@@ -457,6 +457,14 @@ gomp_team_start (void (*fn) (void *), void *data,
unsigned nthreads,
    else
      bind = omp_proc_bind_false;

+  unsigned bits_per_ull = sizeof (unsigned long long) * CHAR_BIT;
+  int id_arr_len = ((nthreads + pool->threads_used) / bits_per_ull) +
1;
+  unsigned long long new_ids[id_arr_len];
+  for (int j = 0; j < id_arr_len; j++)
+    {
+      new_ids[j] = 0;
+    }
+
    /* We only allow the reuse of idle threads for non-nested PARALLEL
       regions.  This appears to be implied by the semantics of
       threadprivate variables, but perhaps that's reading too much
into
@@ -464,6 +472,11 @@ gomp_team_start (void (*fn) (void *), void *data,
unsigned nthreads,
       only the initial program thread will modify gomp_threads.  */
    if (!nested)
      {
+      /* This current thread is always re-used in next team.  */
+      unsigned total_reused = 1;
+      gomp_assert (team->prev_ts.team_id == 0,
+                  "Starting a team from thread with id %u in previous
team\n",
+                  team->prev_ts.team_id);
        old_threads_used = pool->threads_used;

        if (nthreads <= old_threads_used)
@@ -474,13 +487,7 @@ gomp_team_start (void (*fn) (void *), void *data,
unsigned nthreads,
           gomp_simple_barrier_init (&pool->threads_dock, nthreads);
         }
        else
-       {
-         n = old_threads_used;
-
-         /* Increase the barrier threshold to make sure all new
-            threads arrive before the team is released.  */
-         gomp_simple_barrier_reinit (&pool->threads_dock, nthreads);
-       }
+       n = old_threads_used;

        /* Not true yet, but soon will be.  We're going to release all
          threads from the dock, and those that aren't part of the
@@ -502,6 +509,7 @@ gomp_team_start (void (*fn) (void *), void *data,
unsigned nthreads,
         }

        /* Release existing idle threads.  */
+      bool have_prepared = false;
        for (; i < n; ++i)
         {
           unsigned int place_partition_off = thr-

ts.place_partition_off;

@@ -643,7 +651,23 @@ gomp_team_start (void (*fn) (void *), void *data,
unsigned nthreads,
           nthr->ts.team = team;
           nthr->ts.work_share = &team->work_shares[0];
           nthr->ts.last_work_share = NULL;
+         /* If we're changing any threads team_id then we need to
wait for all
+            other threads to have reached the barrier.  */
+         if (nthr->ts.team_id != i && !have_prepared)
+           {
+             gomp_simple_barrier_prepare_reinit (&pool->threads_dock,
+                                                 thr->ts.team_id);
+             have_prepared = true;
+           }
           nthr->ts.team_id = i;
+         {
+           unsigned idx = (i / bits_per_ull);
+           gomp_assert (!(new_ids[idx] & (1ULL << (i %
bits_per_ull))),
+                        "new_ids[%u] == %llu (for `i` %u)", idx,
new_ids[idx],
+                        i);
+           new_ids[idx] |= (1ULL << (i % bits_per_ull));
+         }
+         total_reused += 1;
           nthr->ts.level = team->prev_ts.level + 1;
           nthr->ts.active_level = thr->ts.active_level;
           nthr->ts.place_partition_off = place_partition_off;
@@ -714,13 +738,54 @@ gomp_team_start (void (*fn) (void *), void
*data, unsigned nthreads,
                     }
                   break;
                 }
+           }
+       }

-             /* Increase the barrier threshold to make sure all new
-                threads and all the threads we're going to let die
-                arrive before the team is released.  */
-             if (affinity_count)
-               gomp_simple_barrier_reinit (&pool->threads_dock,
-                                           nthreads +
affinity_count);
+      /* If we are changing the number of threads *or* if we are
starting new
+        threads for any reason.  Then update the barrier accordingly.
+
+        The handling of the barrier here is different for the
different
+        designs of barrier.
+
+        The `posix/bar.h` design needs to "grow" to accomodate the
extra
+        threads that we'll wait on, then "shrink" to the size we want
+        eventually.
+
+        The `linux/bar.h` design needs to assign positions for each
thread.
+        Some of the threads getting started will want the position of
a thread
+        that is currently running.  Hence we need to (1) serialise
existing
+        threads then (2) set up barierr state for the incoming new
threads.
+        Once this is done we don't need any equivalent of the
"shrink" step
+        later.   This does result in a longer period of serialisation
than
+        the posix/bar.h design, but it seems that this is a fair
trade-off to
+        make for the design that is faster under contention.  */
+      if (old_threads_used != 0
+         && (nthreads != pool->threads_dock.bar.total || i <
nthreads))
+       {
+         /* If all we've done is increase the number of threads that
we want,
+            don't need to serialise anything (wake flags don't need
to be
+            adjusted).  */
+         if (nthreads > old_threads_used && affinity_count == 0
+             && total_reused == old_threads_used
+             /* `have_prepared` can be used to detect whether we re-
shuffled
+                any threads around.  */
+             && !have_prepared
+             && gomp_simple_barrier_has_space (&pool->threads_dock,
nthreads))
+           gomp_simple_barrier_minimal_reinit (&pool->threads_dock,
nthreads,
+                                               nthreads -
old_threads_used);
+         else
+           {
+             /* Otherwise, we need to ensure that we've paused all
existing
+                threads (waiting on us to restart them) before
adjusting their
+                wake flags.  */
+             if (!have_prepared)
+               gomp_simple_barrier_prepare_reinit (&pool-

threads_dock,

+                                                   thr->ts.team_id);
+             gomp_simple_barrier_reinit_1 (&pool->threads_dock,
nthreads,
+                                           nthreads <= total_reused
+                                             ? 0
+                                             : nthreads -
total_reused,
+                                           new_ids);
             }
         }

@@ -868,9 +933,9 @@ gomp_team_start (void (*fn) (void *), void *data,
unsigned nthreads,

   do_release:
    if (nested)
-    gomp_barrier_wait (&team->barrier);
+    gomp_barrier_wait (&team->barrier, thr->ts.team_id);
    else
-    gomp_simple_barrier_wait (&pool->threads_dock);
+    gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);

    /* Decrease the barrier threshold to match the number of threads
       that should arrive back at the end of this team.  The extra
@@ -888,8 +953,7 @@ gomp_team_start (void (*fn) (void *), void *data,
unsigned nthreads,
        if (affinity_count)
         diff = -affinity_count;

-      gomp_simple_barrier_reinit (&pool->threads_dock, nthreads);
-
+      gomp_simple_barrier_reinit_2 (&pool->threads_dock, nthreads);
  #ifdef HAVE_SYNC_BUILTINS
        __sync_fetch_and_add (&gomp_managed_threads, diff);
  #else
@@ -949,12 +1013,13 @@ gomp_team_end (void)
  {
    struct gomp_thread *thr = gomp_thread ();
    struct gomp_team *team = thr->ts.team;
+  unsigned team_id = thr->ts.team_id;

    /* This barrier handles all pending explicit threads.
       As #pragma omp cancel parallel might get awaited count in
       team->barrier in a inconsistent state, we need to use a
different
       counter here.  */
-  gomp_team_barrier_wait_final (&team->barrier);
+  gomp_team_barrier_wait_final (&team->barrier, thr->ts.team_id);
    if (__builtin_expect (team->team_cancelled, 0))
      {
        struct gomp_work_share *ws = team->work_shares_to_free;
@@ -985,7 +1050,7 @@ gomp_team_end (void)
  #endif
        /* This barrier has gomp_barrier_wait_last counterparts
          and ensures the team can be safely destroyed.  */
-      gomp_barrier_wait (&team->barrier);
+      gomp_barrier_wait (&team->barrier, team_id);
      }

    if (__builtin_expect (team->work_shares[0].next_alloc != NULL, 0))
@@ -1049,7 +1114,7 @@ gomp_pause_pool_helper (void *thread_pool)
    struct gomp_thread *thr = gomp_thread ();
    struct gomp_thread_pool *pool
      = (struct gomp_thread_pool *) thread_pool;
-  gomp_simple_barrier_wait_last (&pool->threads_dock);
+  gomp_simple_barrier_wait_last (&pool->threads_dock, thr-

ts.team_id);

    gomp_sem_destroy (&thr->release);
    thr->thread_pool = NULL;
    thr->task = NULL;
@@ -1081,10 +1146,10 @@ gomp_pause_host (void)
               thrs[i] = gomp_thread_to_pthread_t (nthr);
             }
           /* This barrier undocks threads docked on pool-

threads_dock.  */

-         gomp_simple_barrier_wait (&pool->threads_dock);
+         gomp_simple_barrier_wait (&pool->threads_dock, thr-

ts.team_id);

           /* And this waits till all threads have called
gomp_barrier_wait_last
              in gomp_pause_pool_helper.  */
-         gomp_simple_barrier_wait (&pool->threads_dock);
+         gomp_simple_barrier_wait (&pool->threads_dock, thr-

ts.team_id);

           /* Now it is safe to destroy the barrier and free the pool.
*/
           gomp_simple_barrier_destroy (&pool->threads_dock);

diff --git a/libgomp/testsuite/libgomp.c/primary-thread-tasking.c
b/libgomp/testsuite/libgomp.c/primary-thread-tasking.c
new file mode 100644
index 00000000000..4b70dca30a2
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/primary-thread-tasking.c
@@ -0,0 +1,80 @@
+/* Test to check our primary thread can execute some tasks while
waiting for
+   other threads.  This to check an edge-case in a recent
implementation of the
+   barrier tasking mechanism.
+   I don't believe there's any way to guarantee that a task will be
run on a
+   given thread.  Hence I don't know anywhere I can put an `abort`
and say we
+   failed.  However I can set things up so that we'll timeout if the
primary
+   thread is not executing any tasks.  That timeout will at least
count as a
+   fail.
+
+   Idea here being that we keep spawning tasks until one is handled
by the
+   primary thread.  Meanwhile we give the secondary threads lots of
+   opportunities to sleep and let the primary thread take a task.  */
+/* { dg-do run  { target *-linux-* } } */
+
+#define _GNU_SOURCE
+#include <omp.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <linux/futex.h>
+#include <assert.h>
+#include <stdatomic.h>
+#include <limits.h>
+
+int wake_flag = 0;
+
+void
+continue_until_on_thread0 ()
+{
+  if (omp_get_thread_num () == 0)
+    {
+      __atomic_fetch_add (&wake_flag, 1, memory_order_relaxed);
+      syscall (SYS_futex, &wake_flag, FUTEX_WAKE |
FUTEX_PRIVATE_FLAG, INT_MAX);
+    }
+  else
+    {
+      /* If the flag has been set try again.  Otherwise put another
few tasks
+       * on the task queue.  */
+      if (__atomic_load_n (&wake_flag, memory_order_relaxed))
+       {
+         return;
+       }
+#pragma omp task
+      continue_until_on_thread0 ();
+#pragma omp task
+      continue_until_on_thread0 ();
+#pragma omp task
+      continue_until_on_thread0 ();
+      syscall (SYS_futex, &wake_flag, FUTEX_WAIT |
FUTEX_PRIVATE_FLAG, 0, NULL);
+    }
+}
+
+int
+foo ()
+{
+#pragma omp parallel
+  {
+    if (omp_get_thread_num () != 0)
+      {
+#pragma omp task
+       continue_until_on_thread0 ();
+#pragma omp task
+       continue_until_on_thread0 ();
+       /* Wait on the master thread to have executed one of the
tasks.  */
+       int val = __atomic_load_n (&wake_flag, memory_order_acquire);
+       while (val == 0)
+         {
+           syscall (SYS_futex, &wake_flag, FUTEX_WAIT |
FUTEX_PRIVATE_FLAG,
+                    val, NULL);
+           val = __atomic_load_n (&wake_flag, memory_order_acquire);
+         }
+      }
+  }
+}
+
+int
+main ()
+{
+  foo ();
+  return 0;
+}
diff --git a/libgomp/work.c b/libgomp/work.c
index eae972ae2d1..b75b97da181 100644
--- a/libgomp/work.c
+++ b/libgomp/work.c
@@ -240,10 +240,14 @@ gomp_work_share_end (void)
        return;
      }

-  bstate = gomp_barrier_wait_start (&team->barrier);
+  bstate = gomp_barrier_wait_start (&team->barrier, thr->ts.team_id);

    if (gomp_barrier_last_thread (bstate))
      {
+      /* For some targets the state returning "last" no longer
indicates that
+        we're *already* last, instead it indicates that we *should
be* last.
+        Perform the relevant synchronisation.  */
+      gomp_team_barrier_ensure_last (&team->barrier, thr->ts.team_id,
bstate);
        if (__builtin_expect (thr->ts.last_work_share != NULL, 1))
         {
           team->work_shares_to_free = thr->ts.work_share;
@@ -251,7 +255,7 @@ gomp_work_share_end (void)
         }
      }

-  gomp_team_barrier_wait_end (&team->barrier, bstate);
+  gomp_team_barrier_wait_end (&team->barrier, bstate, thr-

ts.team_id);

    thr->ts.last_work_share = NULL;
  }

@@ -266,19 +270,27 @@ gomp_work_share_end_cancel (void)
    gomp_barrier_state_t bstate;

    /* Cancellable work sharing constructs cannot be orphaned.  */
-  bstate = gomp_barrier_wait_cancel_start (&team->barrier);
+  bstate = gomp_barrier_wait_cancel_start (&team->barrier, thr-

ts.team_id);


    if (gomp_barrier_last_thread (bstate))
      {
-      if (__builtin_expect (thr->ts.last_work_share != NULL, 1))
+      if (gomp_team_barrier_ensure_cancel_last (&team->barrier, thr-

ts.team_id,

+                                               bstate))
         {
-         team->work_shares_to_free = thr->ts.work_share;
-         free_work_share (team, thr->ts.last_work_share);
+         if (__builtin_expect (thr->ts.last_work_share != NULL, 1))
+           {
+             team->work_shares_to_free = thr->ts.work_share;
+             free_work_share (team, thr->ts.last_work_share);
+           }
         }
+      else
+       gomp_reset_cancellable_primary_threadgen (&team->barrier,
+                                                 thr->ts.team_id);
      }
    thr->ts.last_work_share = NULL;

-  return gomp_team_barrier_wait_cancel_end (&team->barrier, bstate);
+  return gomp_team_barrier_wait_cancel_end (&team->barrier, bstate,
+                                           thr->ts.team_id);
  }

  /* The current thread is done with its current work sharing
construct.
--
2.43.0

Re: [PATCH 3/5] libgomp: Implement "flat" barrier for linux/ target

Reply via email to