Putting this up for visibility -- have not yet ran all benchmarks in my
suite to check for any regressions, but given the minimal change I
doubt there would be any.

This patch is a "part 2" of the work I'm doing towards PR119588.
In that bug I describe how some internal workloads are hitting some
performance problems w.r.t. the creation of many parallel regions one
after each other.

I'm attempting to reduce overhead of parallel region creation when using
a large number of threads.  I've posted one patch to rework the barrier
implementation to something that performs much better in this particular
case:
https://gcc.gnu.org/pipermail/gcc-patches/2025-September/695005.html

This is a second patch to address a different area where I've seen a
performance impact.  It's the area that I mentioned in the below patch
to the GCC mailing list.
https://gcc.gnu.org/pipermail/gcc/2025-September/246618.html

This doesn't address the majority of the performance overhead from
non-barrier related things, but it's a relatively simple piece of work
moving some re-initialisation of data in existing threads from the
primary thread to each secondary thread.  Performance benefit coming
because the work is no longer done by one thread while serialised but is
now done by all threads while not serialised.  This re-initialisation
scales with number of threads -- which is why it has such an impact on
the problematic case I have been looking at.

General idea is that the values that are re-set on each existing
secondary thread are stored in a team data structure by the primary
thread instead of re-set directly on each thread structure.  When a
secondary thread in a non-nested team continues around its main work
loop it resets its own thread-local data from the team cache.

All threads that are not going to be used in the next region do
unnecessarily re-init their `team` data structure etc, but since
`thr->fn` is NULL they exit the do/while loop before that makes little
difference, and the work is spread across threads so it doesn't seem to
block other useful work.

N.b. I've also noticed that the `thr->ts.team` data doesn't need to be
reset in the particular case I'm looking into, but have not made that
change in this patch.  It does seem to provide some measurable gains
w.r.t. the micro-benchmark I'm looking at but it requires much more code
(passing information down to `gomp_team_start` about whether this is a
re-used team or not) so I've left it for the moment.

------------------------------
I believe there must be more opportunity to reduce the overhead on the
creation of multiple parallel regions in sequence.  The micro-benchmark
I'm looking at shows LLVM can do this with very little overhead over a
single barrier (parallel region creation being about 16% slower than
a barrier for LLVM, as opposed to the 4.3x slower for 144 threads that
I'm seeing at the moment for GNU).

I'm hoping this is a straight-forward step in the right direction.
As I've mentioned on IRC and on the bugzilla I also hope to only have
one full barrier in between parallel regions (split into two
half-barriers providing the necessary synchronisation for exit and
entry) instead of the two barriers we currently have.  From there I hope
to look a but closer at data structure re-initialisation when re-using
the same team, but as yet don't have an idea for the actual change I
hope to make.

Signed-off-by: Matthew Malcomson <[email protected]>
---
 libgomp/libgomp.h |  8 ++++++++
 libgomp/team.c    | 30 +++++++++++++++++++-----------
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index a43398300a5..0dac5871d29 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -857,6 +857,14 @@ struct gomp_team
   /* Number of tasks waiting for their completion event to be fulfilled.  */
   unsigned int task_detach_count;
 
+  /* Cache to help initialise each secondary threads `task` when re-using a
+     team very many times.  This allows each secondary thread to perform the
+     re-initialisation instead of having the primary thread need to perform
+     that initialisation.  */
+  struct gomp_task_icv task_icv_init;
+  struct gomp_task *task_parent_init;
+  struct gomp_taskgroup *task_taskgroup;
+
   /* This array contains structures for implicit tasks.  */
   struct gomp_task implicit_task[];
 };
diff --git a/libgomp/team.c b/libgomp/team.c
index cb1d3235312..984e90f56d1 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -131,6 +131,16 @@ gomp_thread_start (void *xdata)
          gomp_finish_task (task);
 
          gomp_simple_barrier_wait (&pool->threads_dock);
+         team = thr->ts.team;
+         thr->ts.work_share = &team->work_shares[0];
+         thr->ts.last_work_share = NULL;
+#ifdef HAVE_SYNC_BUILTINS
+         thr->ts.single_count = 0;
+#endif
+         thr->ts.static_trip = 0;
+         thr->task = &team->implicit_task[thr->ts.team_id];
+         gomp_init_task (thr->task, team->task_parent_init, 
&team->task_icv_init);
+         thr->task->taskgroup = team->task_taskgroup;
 
          local_fn = thr->fn;
          local_data = thr->data;
@@ -383,6 +393,15 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
   thr->task->taskgroup = taskgroup;
   team->implicit_task[0].icv.nthreads_var = nthreads_var;
   team->implicit_task[0].icv.bind_var = bind_var;
+  /* Copy entirely rather than copy pointer so that we have an immutable copy
+     for the secondary threads to take from for the time-frame that said
+     secondary threads are executing on.  Can't copy from
+     `team->implicit_task[0].icv` because primary thread could have adjusted
+     its per-task ICV by the time the secondary thread gets around to
+     initialising things.   */
+  team->task_icv_init = team->implicit_task[0].icv;
+  team->task_parent_init = task;
+  team->task_taskgroup = taskgroup;
 
   if (nthreads == 1)
     return;
@@ -641,26 +660,15 @@ gomp_team_start (void (*fn) (void *), void *data, 
unsigned nthreads,
          else
            nthr = pool->threads[i];
          nthr->ts.team = team;
-         nthr->ts.work_share = &team->work_shares[0];
-         nthr->ts.last_work_share = NULL;
          nthr->ts.team_id = i;
          nthr->ts.level = team->prev_ts.level + 1;
          nthr->ts.active_level = thr->ts.active_level;
          nthr->ts.place_partition_off = place_partition_off;
          nthr->ts.place_partition_len = place_partition_len;
          nthr->ts.def_allocator = thr->ts.def_allocator;
-#ifdef HAVE_SYNC_BUILTINS
-         nthr->ts.single_count = 0;
-#endif
-         nthr->ts.static_trip = 0;
          nthr->num_teams = thr->num_teams;
          nthr->team_num = thr->team_num;
-         nthr->task = &team->implicit_task[i];
          nthr->place = place;
-         gomp_init_task (nthr->task, task, icv);
-         team->implicit_task[i].icv.nthreads_var = nthreads_var;
-         team->implicit_task[i].icv.bind_var = bind_var;
-         nthr->task->taskgroup = taskgroup;
          nthr->fn = fn;
          nthr->data = data;
          team->ordered_release[i] = &nthr->release;
-- 
2.43.0

Reply via email to