> -----Original Message-----
> From: [email protected] <[email protected]>
> Sent: 26 November 2025 18:42
> To: [email protected]; Jakub Jelinek <[email protected]>; Tobias
> Burnus <[email protected]>
> Cc: Julian Brown <[email protected]>; Thomas Schwinge
> <[email protected]>; Andrew Stubbs <[email protected]>; Tom de
> Vries <[email protected]>; Sebastian Huber <sebastian.huber@embedded-
> brains.de>; Matthew Malcomson <[email protected]>
> Subject: [PATCH 5/5] libgomp: Move thread task re-initialisation into
> threads
> 
> External email: Use caution opening links or attachments
> 
> 
> From: Matthew Malcomson <[email protected]>
> 
> This patch is a separate beneficial step of the work I'm doing towards
> PR119588.  In that bug I describe how some internal workloads are
> hitting some performance problems w.r.t. the creation of many parallel
> regions one after each other.
> 
> I'm attempting to reduce overhead of parallel region creation when
> using a large number of threads.  The two patches earlier in this
> patchset rework the barrier implementation to something that performs
> much better in this particular case:
> 
> This is a patch built on top of those, but with appropriate rebase
> could be independent.  It addresses a different area where I've seen a
> performance impact.  It's the area that I mentioned in the below mail
> to the GCC mailing list.
> https://gcc.gnu.org/pipermail/gcc/2025-September/246618.html
> 
> The fundamental point is that the cost of teardown & setup of a
> parallel region for LLVM is very close to the cost of a single
> barrier.  For us there is a significant chunk of extra work in the
> actual teardown and setup.
> 
> When re-using a team with no changes in the affinity ICV etc it would
> seem that the ideal case is similar to LLVM -- (parallel region
> creation being about 16% slower than a barrier for LLVM, as opposed to
> the 4.3x slower for 144 threads that I'm seeing at the moment for
> GNU).
> This patch gets nowhere near that ideal.  In this patch I observe that
> the majority of the performance overhead in this case comes from re-
> initialising the existing threads.  While all existing threads are
> waiting on the primary thread, that primary thread is iterating over
> each secondary threads data structures and setting it as needed for
> the next parallel region.
> 
> There are a number of stores that could be relatively easily performed
> in each secondary thread after it is released by the primary thread --
> removing this work from the serialised region.  This patch proposes
> that much simpler setup.  This re-initialisation scales with number of
> threads -- which is why it has such an impact on the problematic case
> I have been looking at.
> 
> Only subtlety here is that in order to ensure memory safety we need to
> avoid performing this initialisation when `thr->fn == NULL`.  If a
> secondary non-nested thread gets "left to exit" when a new team is
> started the primary thread could race ahead, cache the new team, and
> free the team that this secondary thread was using all before the
> secondary thread goes through the new initialisation code.  That could
> lead to a use-after-free.  Hence we have to early-exit when this
> thread is not going to perform any more work.
> - N.b. When running TSAN before and after I realised that this change
>   happens to remove a race between `gomp_finish_task` in the secondary
>   thread and `gomp_init_task` in the primary thread (when re-using a
>   team and hence `nthr->task` is in the same place as the threads
> local
>   `task` variable).
> 
> All threads that are not going to be used in the next region do
> unnecessarily re-init their `team` data structure etc, but since `thr-
> >fn` is NULL they exit the do/while loop before that makes little
> difference, and the work is spread across threads so it shouldn't
> block other useful work.
> 
> I imagine there is a good chunk of performance to be gained in adding
> more logic here:
> 1) Something recording whether the affinity setup has changed since
> the
>    last non-nested team and if not removing the calculation of places
>    and assignment to `nthr->ts.place_partition_*` variables.
> 2) I could move more members that are assigned multiple times to
> `team`
>    for each secondary thread to take.
>    - `data` pointer.
>    - `num_teams`.
>    - `team_num`.
>    - Honestly should have looked into this in the patch -- noticed it
>      while writing this cover letter and will look into whether this
> is
>      feasible relatively soon.
> 3) If we can identify that we're re-using a team from the last
> parallel
>    region (and affinity ICV has not changed) it seems that we could
>    avoid re-initialising some of its fields:
>    - ordered_release[i] should already point to `nthr`?
>    - ts.team_id should already be set.
> Overall if we can identify that we're using the same team as was
> cached and we don't need to change anything we should be able to get
> away with drastically less work in the serial part of the call to
> GOMP_parallel.
> 
> I did run the micro-benchmark I'm looking at (mentioned in PR119588)
> on some hacked up versions of the above and they seemed to provide
> measurable gains.  I haven't tried to put them all in this patch due
> to lack of time and the relatively higher complexity of the
> bookkeeping.
> I'm hoping this is a straight-forward step in the right direction.
> 
> -------
> Testing done (note -- much of the testing done on top of my other
> patches rather than by itself):
> - Bootstrap & regtest on x86_64 and aarch64
>   - Testsuite ran with OMP_WAIT_POLICY=PASSIVE as well.
>   - Also done with `--enable-linux-futex=no` for posix/ target.
>   - Also done with `_LIBGOMP_CHECKING_` set for assertions.
> - Cross compiler & regtest on arm (qemu).
> - TSAN done combining all my other patches upstream.
Hi,
ping: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702030.html

Thanks,
Prathamesh
> 
> libgomp/ChangeLog:
> 
>         PR libgomp/119588
>         * libgomp.h (struct gomp_team): Add fields for communication
>         at team start time.
>         * team.c (gomp_thread_start): Initialise thread-local data
> using
>         members stored on team.
>         (gomp_team_start): Primary thread stores data on team
> structure
>         instead of copying that data to each secondary threads
>         gomp_thread structure.
> 
> Signed-off-by: Matthew Malcomson <[email protected]>
> ---
>  libgomp/libgomp.h |  8 ++++++++
>  libgomp/team.c    | 37 ++++++++++++++++++++++++++-----------
>  2 files changed, 34 insertions(+), 11 deletions(-)
> 
> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index
> 4269a3f4e60..b188567ca28 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -866,6 +866,14 @@ struct gomp_team
>    /* Number of tasks waiting for their completion event to be
> fulfilled.  */
>    unsigned int task_detach_count;
> 
> +  /* Cache to help initialise each secondary threads `task` when re-
> using a
> +     team very many times.  This allows each secondary thread to
> perform the
> +     re-initialisation instead of having the primary thread need to
> perform
> +     that initialisation.  */
> +  struct gomp_task_icv task_icv_init;
> +  struct gomp_task *task_parent_init;
> +  struct gomp_taskgroup *task_taskgroup;
> +
>    /* This array contains structures for implicit tasks.  */
>    struct gomp_task implicit_task[];
>  };
> diff --git a/libgomp/team.c b/libgomp/team.c index
> 30aefcd22b6..0553b8d83a3 100644
> --- a/libgomp/team.c
> +++ b/libgomp/team.c
> @@ -158,6 +158,23 @@ gomp_thread_start (void *xdata)
>           if (!gomp_barrier_can_hold (&team->barrier) || thr->fn ==
> NULL)
>             gomp_simple_barrier_wait (&pool->threads_dock);
> 
> +         /* Early exit because if `thr->fn == NULL` then we can't
> guarantee
> +            the `thr->ts.team` hasn't been freed by the primary
> thread racing
> +            ahead.  Hence we don't want to write to it.  */
> +         if (thr->fn == NULL)
> +           break;
> +         team = thr->ts.team;
> +         thr->ts.work_share = &team->work_shares[0];
> +         thr->ts.last_work_share = NULL; #ifdef HAVE_SYNC_BUILTINS
> +         thr->ts.single_count = 0;
> +#endif
> +         thr->ts.static_trip = 0;
> +         thr->task = &team->implicit_task[thr->ts.team_id];
> +         gomp_init_task (thr->task, team->task_parent_init,
> +                         &team->task_icv_init);
> +         thr->task->taskgroup = team->task_taskgroup;
> +
>           local_fn = thr->fn;
>           local_data = thr->data;
>           thr->fn = NULL;
> @@ -437,6 +454,15 @@ gomp_team_start (void (*fn) (void *), void *data,
> unsigned nthreads,
>    thr->task->taskgroup = taskgroup;
>    team->implicit_task[0].icv.nthreads_var = nthreads_var;
>    team->implicit_task[0].icv.bind_var = bind_var;
> +  /* Copy entirely rather than copy pointer so that we have an
> immutable copy
> +     for the secondary threads to take from for the time-frame that
> said
> +     secondary threads are executing on.  Can't copy from
> +     `team->implicit_task[0].icv` because primary thread could have
> adjusted
> +     its per-task ICV by the time the secondary thread gets around to
> +     initialising things.   */
> +  team->task_icv_init = team->implicit_task[0].icv;
> + team->task_parent_init = task;  team->task_taskgroup = taskgroup;
> 
>    if (nthreads == 1)
>      return;
> @@ -725,26 +751,15 @@ gomp_team_start (void (*fn) (void *), void
> *data, unsigned nthreads,
>           else
>             nthr = pool->threads[i];
>           __atomic_store_n (&nthr->ts.team, team, MEMMODEL_RELEASE);
> -         nthr->ts.work_share = &team->work_shares[0];
> -         nthr->ts.last_work_share = NULL;
>           nthr->ts.team_id = i;
>           nthr->ts.level = team->prev_ts.level + 1;
>           nthr->ts.active_level = thr->ts.active_level;
>           nthr->ts.place_partition_off = place_partition_off;
>           nthr->ts.place_partition_len = place_partition_len;
>           nthr->ts.def_allocator = thr->ts.def_allocator; -#ifdef
> HAVE_SYNC_BUILTINS
> -         nthr->ts.single_count = 0;
> -#endif
> -         nthr->ts.static_trip = 0;
>           nthr->num_teams = thr->num_teams;
>           nthr->team_num = thr->team_num;
> -         nthr->task = &team->implicit_task[i];
>           nthr->place = place;
> -         gomp_init_task (nthr->task, task, icv);
> -         team->implicit_task[i].icv.nthreads_var = nthreads_var;
> -         team->implicit_task[i].icv.bind_var = bind_var;
> -         nthr->task->taskgroup = taskgroup;
>           nthr->fn = fn;
>           nthr->data = data;
>           team->ordered_release[i] = &nthr->release;
> --
> 2.43.0

Reply via email to