Ping
On 1/29/26 02:02, Prathamesh Kulkarni wrote:
-----Original Message-----
From: [email protected] <[email protected]>
Sent: 26 November 2025 18:42
To: [email protected]; Jakub Jelinek <[email protected]>; Tobias
Burnus <[email protected]>
Cc: Julian Brown <[email protected]>; Thomas Schwinge
<[email protected]>; Andrew Stubbs <[email protected]>; Tom de
Vries <[email protected]>; Sebastian Huber <sebastian.huber@embedded-
brains.de>; Matthew Malcomson <[email protected]>
Subject: [PATCH 5/5] libgomp: Move thread task re-initialisation into
threads
External email: Use caution opening links or attachments
From: Matthew Malcomson <[email protected]>
This patch is a separate beneficial step of the work I'm doing towards
PR119588. In that bug I describe how some internal workloads are
hitting some performance problems w.r.t. the creation of many parallel
regions one after each other.
I'm attempting to reduce overhead of parallel region creation when
using a large number of threads. The two patches earlier in this
patchset rework the barrier implementation to something that performs
much better in this particular case:
This is a patch built on top of those, but with appropriate rebase
could be independent. It addresses a different area where I've seen a
performance impact. It's the area that I mentioned in the below mail
to the GCC mailing list.
https://gcc.gnu.org/pipermail/gcc/2025-September/246618.html
The fundamental point is that the cost of teardown & setup of a
parallel region for LLVM is very close to the cost of a single
barrier. For us there is a significant chunk of extra work in the
actual teardown and setup.
When re-using a team with no changes in the affinity ICV etc it would
seem that the ideal case is similar to LLVM -- (parallel region
creation being about 16% slower than a barrier for LLVM, as opposed to
the 4.3x slower for 144 threads that I'm seeing at the moment for
GNU).
This patch gets nowhere near that ideal. In this patch I observe that
the majority of the performance overhead in this case comes from re-
initialising the existing threads. While all existing threads are
waiting on the primary thread, that primary thread is iterating over
each secondary threads data structures and setting it as needed for
the next parallel region.
There are a number of stores that could be relatively easily performed
in each secondary thread after it is released by the primary thread --
removing this work from the serialised region. This patch proposes
that much simpler setup. This re-initialisation scales with number of
threads -- which is why it has such an impact on the problematic case
I have been looking at.
Only subtlety here is that in order to ensure memory safety we need to
avoid performing this initialisation when `thr->fn == NULL`. If a
secondary non-nested thread gets "left to exit" when a new team is
started the primary thread could race ahead, cache the new team, and
free the team that this secondary thread was using all before the
secondary thread goes through the new initialisation code. That could
lead to a use-after-free. Hence we have to early-exit when this
thread is not going to perform any more work.
- N.b. When running TSAN before and after I realised that this change
happens to remove a race between `gomp_finish_task` in the secondary
thread and `gomp_init_task` in the primary thread (when re-using a
team and hence `nthr->task` is in the same place as the threads
local
`task` variable).
All threads that are not going to be used in the next region do
unnecessarily re-init their `team` data structure etc, but since `thr-
fn` is NULL they exit the do/while loop before that makes little
difference, and the work is spread across threads so it shouldn't
block other useful work.
I imagine there is a good chunk of performance to be gained in adding
more logic here:
1) Something recording whether the affinity setup has changed since
the
last non-nested team and if not removing the calculation of places
and assignment to `nthr->ts.place_partition_*` variables.
2) I could move more members that are assigned multiple times to
`team`
for each secondary thread to take.
- `data` pointer.
- `num_teams`.
- `team_num`.
- Honestly should have looked into this in the patch -- noticed it
while writing this cover letter and will look into whether this
is
feasible relatively soon.
3) If we can identify that we're re-using a team from the last
parallel
region (and affinity ICV has not changed) it seems that we could
avoid re-initialising some of its fields:
- ordered_release[i] should already point to `nthr`?
- ts.team_id should already be set.
Overall if we can identify that we're using the same team as was
cached and we don't need to change anything we should be able to get
away with drastically less work in the serial part of the call to
GOMP_parallel.
I did run the micro-benchmark I'm looking at (mentioned in PR119588)
on some hacked up versions of the above and they seemed to provide
measurable gains. I haven't tried to put them all in this patch due
to lack of time and the relatively higher complexity of the
bookkeeping.
I'm hoping this is a straight-forward step in the right direction.
-------
Testing done (note -- much of the testing done on top of my other
patches rather than by itself):
- Bootstrap & regtest on x86_64 and aarch64
- Testsuite ran with OMP_WAIT_POLICY=PASSIVE as well.
- Also done with `--enable-linux-futex=no` for posix/ target.
- Also done with `_LIBGOMP_CHECKING_` set for assertions.
- Cross compiler & regtest on arm (qemu).
- TSAN done combining all my other patches upstream.
Hi,
ping: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702030.html
Thanks,
Prathamesh
libgomp/ChangeLog:
PR libgomp/119588
* libgomp.h (struct gomp_team): Add fields for communication
at team start time.
* team.c (gomp_thread_start): Initialise thread-local data
using
members stored on team.
(gomp_team_start): Primary thread stores data on team
structure
instead of copying that data to each secondary threads
gomp_thread structure.
Signed-off-by: Matthew Malcomson <[email protected]>
---
libgomp/libgomp.h | 8 ++++++++
libgomp/team.c | 37 ++++++++++++++++++++++++++-----------
2 files changed, 34 insertions(+), 11 deletions(-)
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index
4269a3f4e60..b188567ca28 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -866,6 +866,14 @@ struct gomp_team
/* Number of tasks waiting for their completion event to be
fulfilled. */
unsigned int task_detach_count;
+ /* Cache to help initialise each secondary threads `task` when re-
using a
+ team very many times. This allows each secondary thread to
perform the
+ re-initialisation instead of having the primary thread need to
perform
+ that initialisation. */
+ struct gomp_task_icv task_icv_init;
+ struct gomp_task *task_parent_init;
+ struct gomp_taskgroup *task_taskgroup;
+
/* This array contains structures for implicit tasks. */
struct gomp_task implicit_task[];
};
diff --git a/libgomp/team.c b/libgomp/team.c index
30aefcd22b6..0553b8d83a3 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -158,6 +158,23 @@ gomp_thread_start (void *xdata)
if (!gomp_barrier_can_hold (&team->barrier) || thr->fn ==
NULL)
gomp_simple_barrier_wait (&pool->threads_dock);
+ /* Early exit because if `thr->fn == NULL` then we can't
guarantee
+ the `thr->ts.team` hasn't been freed by the primary
thread racing
+ ahead. Hence we don't want to write to it. */
+ if (thr->fn == NULL)
+ break;
+ team = thr->ts.team;
+ thr->ts.work_share = &team->work_shares[0];
+ thr->ts.last_work_share = NULL; #ifdef HAVE_SYNC_BUILTINS
+ thr->ts.single_count = 0;
+#endif
+ thr->ts.static_trip = 0;
+ thr->task = &team->implicit_task[thr->ts.team_id];
+ gomp_init_task (thr->task, team->task_parent_init,
+ &team->task_icv_init);
+ thr->task->taskgroup = team->task_taskgroup;
+
local_fn = thr->fn;
local_data = thr->data;
thr->fn = NULL;
@@ -437,6 +454,15 @@ gomp_team_start (void (*fn) (void *), void *data,
unsigned nthreads,
thr->task->taskgroup = taskgroup;
team->implicit_task[0].icv.nthreads_var = nthreads_var;
team->implicit_task[0].icv.bind_var = bind_var;
+ /* Copy entirely rather than copy pointer so that we have an
immutable copy
+ for the secondary threads to take from for the time-frame that
said
+ secondary threads are executing on. Can't copy from
+ `team->implicit_task[0].icv` because primary thread could have
adjusted
+ its per-task ICV by the time the secondary thread gets around to
+ initialising things. */
+ team->task_icv_init = team->implicit_task[0].icv;
+ team->task_parent_init = task; team->task_taskgroup = taskgroup;
if (nthreads == 1)
return;
@@ -725,26 +751,15 @@ gomp_team_start (void (*fn) (void *), void
*data, unsigned nthreads,
else
nthr = pool->threads[i];
__atomic_store_n (&nthr->ts.team, team, MEMMODEL_RELEASE);
- nthr->ts.work_share = &team->work_shares[0];
- nthr->ts.last_work_share = NULL;
nthr->ts.team_id = i;
nthr->ts.level = team->prev_ts.level + 1;
nthr->ts.active_level = thr->ts.active_level;
nthr->ts.place_partition_off = place_partition_off;
nthr->ts.place_partition_len = place_partition_len;
nthr->ts.def_allocator = thr->ts.def_allocator; -#ifdef
HAVE_SYNC_BUILTINS
- nthr->ts.single_count = 0;
-#endif
- nthr->ts.static_trip = 0;
nthr->num_teams = thr->num_teams;
nthr->team_num = thr->team_num;
- nthr->task = &team->implicit_task[i];
nthr->place = place;
- gomp_init_task (nthr->task, task, icv);
- team->implicit_task[i].icv.nthreads_var = nthreads_var;
- team->implicit_task[i].icv.bind_var = bind_var;
- nthr->task->taskgroup = taskgroup;
nthr->fn = fn;
nthr->data = data;
team->ordered_release[i] = &nthr->release;
--
2.43.0