Putting this up for visibility -- have not yet ran all benchmarks in my suite to check for any regressions, but given the minimal change I doubt there would be any.
This patch is a "part 2" of the work I'm doing towards PR119588. In that bug I describe how some internal workloads are hitting some performance problems w.r.t. the creation of many parallel regions one after each other. I'm attempting to reduce overhead of parallel region creation when using a large number of threads. I've posted one patch to rework the barrier implementation to something that performs much better in this particular case: https://gcc.gnu.org/pipermail/gcc-patches/2025-September/695005.html This is a second patch to address a different area where I've seen a performance impact. It's the area that I mentioned in the below patch to the GCC mailing list. https://gcc.gnu.org/pipermail/gcc/2025-September/246618.html This doesn't address the majority of the performance overhead from non-barrier related things, but it's a relatively simple piece of work moving some re-initialisation of data in existing threads from the primary thread to each secondary thread. Performance benefit coming because the work is no longer done by one thread while serialised but is now done by all threads while not serialised. This re-initialisation scales with number of threads -- which is why it has such an impact on the problematic case I have been looking at. General idea is that the values that are re-set on each existing secondary thread are stored in a team data structure by the primary thread instead of re-set directly on each thread structure. When a secondary thread in a non-nested team continues around its main work loop it resets its own thread-local data from the team cache. All threads that are not going to be used in the next region do unnecessarily re-init their `team` data structure etc, but since `thr->fn` is NULL they exit the do/while loop before that makes little difference, and the work is spread across threads so it doesn't seem to block other useful work. N.b. I've also noticed that the `thr->ts.team` data doesn't need to be reset in the particular case I'm looking into, but have not made that change in this patch. It does seem to provide some measurable gains w.r.t. the micro-benchmark I'm looking at but it requires much more code (passing information down to `gomp_team_start` about whether this is a re-used team or not) so I've left it for the moment. ------------------------------ I believe there must be more opportunity to reduce the overhead on the creation of multiple parallel regions in sequence. The micro-benchmark I'm looking at shows LLVM can do this with very little overhead over a single barrier (parallel region creation being about 16% slower than a barrier for LLVM, as opposed to the 4.3x slower for 144 threads that I'm seeing at the moment for GNU). I'm hoping this is a straight-forward step in the right direction. As I've mentioned on IRC and on the bugzilla I also hope to only have one full barrier in between parallel regions (split into two half-barriers providing the necessary synchronisation for exit and entry) instead of the two barriers we currently have. From there I hope to look a but closer at data structure re-initialisation when re-using the same team, but as yet don't have an idea for the actual change I hope to make. Signed-off-by: Matthew Malcomson <[email protected]> --- libgomp/libgomp.h | 8 ++++++++ libgomp/team.c | 30 +++++++++++++++++++----------- 2 files changed, 27 insertions(+), 11 deletions(-) diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index a43398300a5..0dac5871d29 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -857,6 +857,14 @@ struct gomp_team /* Number of tasks waiting for their completion event to be fulfilled. */ unsigned int task_detach_count; + /* Cache to help initialise each secondary threads `task` when re-using a + team very many times. This allows each secondary thread to perform the + re-initialisation instead of having the primary thread need to perform + that initialisation. */ + struct gomp_task_icv task_icv_init; + struct gomp_task *task_parent_init; + struct gomp_taskgroup *task_taskgroup; + /* This array contains structures for implicit tasks. */ struct gomp_task implicit_task[]; }; diff --git a/libgomp/team.c b/libgomp/team.c index cb1d3235312..984e90f56d1 100644 --- a/libgomp/team.c +++ b/libgomp/team.c @@ -131,6 +131,16 @@ gomp_thread_start (void *xdata) gomp_finish_task (task); gomp_simple_barrier_wait (&pool->threads_dock); + team = thr->ts.team; + thr->ts.work_share = &team->work_shares[0]; + thr->ts.last_work_share = NULL; +#ifdef HAVE_SYNC_BUILTINS + thr->ts.single_count = 0; +#endif + thr->ts.static_trip = 0; + thr->task = &team->implicit_task[thr->ts.team_id]; + gomp_init_task (thr->task, team->task_parent_init, &team->task_icv_init); + thr->task->taskgroup = team->task_taskgroup; local_fn = thr->fn; local_data = thr->data; @@ -383,6 +393,15 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads, thr->task->taskgroup = taskgroup; team->implicit_task[0].icv.nthreads_var = nthreads_var; team->implicit_task[0].icv.bind_var = bind_var; + /* Copy entirely rather than copy pointer so that we have an immutable copy + for the secondary threads to take from for the time-frame that said + secondary threads are executing on. Can't copy from + `team->implicit_task[0].icv` because primary thread could have adjusted + its per-task ICV by the time the secondary thread gets around to + initialising things. */ + team->task_icv_init = team->implicit_task[0].icv; + team->task_parent_init = task; + team->task_taskgroup = taskgroup; if (nthreads == 1) return; @@ -641,26 +660,15 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads, else nthr = pool->threads[i]; nthr->ts.team = team; - nthr->ts.work_share = &team->work_shares[0]; - nthr->ts.last_work_share = NULL; nthr->ts.team_id = i; nthr->ts.level = team->prev_ts.level + 1; nthr->ts.active_level = thr->ts.active_level; nthr->ts.place_partition_off = place_partition_off; nthr->ts.place_partition_len = place_partition_len; nthr->ts.def_allocator = thr->ts.def_allocator; -#ifdef HAVE_SYNC_BUILTINS - nthr->ts.single_count = 0; -#endif - nthr->ts.static_trip = 0; nthr->num_teams = thr->num_teams; nthr->team_num = thr->team_num; - nthr->task = &team->implicit_task[i]; nthr->place = place; - gomp_init_task (nthr->task, task, icv); - team->implicit_task[i].icv.nthreads_var = nthreads_var; - team->implicit_task[i].icv.bind_var = bind_var; - nthr->task->taskgroup = taskgroup; nthr->fn = fn; nthr->data = data; team->ordered_release[i] = &nthr->release; -- 2.43.0
