subject:"\[OpenMP, nvptx\] Use bar.sync\/arrive for barriers when tasking is not used"

Re: [OpenMP, nvptx] Use bar.sync/arrive for barriers when tasking is not used

2022-09-01 Thread Jakub Jelinek via Gcc-patches

On Thu, Sep 01, 2022 at 11:39:42PM +0800, Chung-Lin Tang wrote:
> our work on SPEChpc2021 benchmarks show that, after the fix for PR99555 was 
> committed:
> [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
> https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5ed77fb3ed1ee0289a0ec9499ef52b99b39421f1
> 
> while that patch fixed the hang, there were quite severe performance 
> regressions caused
> by this new barrier code. Under OpenMP target offload mode, Minisweep 
> regressed by about 350%,
> while HPGMG-FV was about 2x slower.
> 
> So the problem was presumably the new barriers, which replaced erroneous but 
> fast bar.sync
> instructions, with correct but really heavy-weight futex_wait/wake operations 
> on the GPU.
> This is probably required for preserving correct task vs. barrier behavior.
> 
> However, the observation is that: when tasks-related functionality are not 
> used at all by
> the team inside an OpenMP target region, and a barrier is just a place to 
> wait for all
> threads to rejoin (no problem of invoking waiting tasks to re-start) a 
> barrier can in that
> case be implemented by simple bar.sync and bar.arrive PTX instructions. That 
> should be
> able to recover most performance the cases that usually matter, e.g. 'omp 
> parallel for' inside
> 'omp target'.
> 
> So the plan is to mark cases where 'tasks are never used'. This patch adds a 
> 'task_never_used'
> flag inside struct gomp_team, initialized to true, and set to false when 
> tasks are added to
> the team. The nvptx specific gomp_team_barrier_wait_end routines can then use 
> simple barrier
> when team->task_never_used remains true on the barrier.

I'll defer the nvptx specific changes to Tom because I'm not familiar enough
with NVPTX.  But I'll certainly object against any changes for this outside
of nvptx.  We don't need or want the task_never_used field and its
maintainance nor GOMP_task_set_used entrypoint in host libgomp.so nor for
NVPTX.
As you use it for many other constructs (master/masked/critical/single -
does omp_set_lock etc. count too?  only one thread acquires the lock, others
don't), it looks very much misnamed, perhaps better talk about thread
divergence or what is the PTX term for it.
Anyway, there is no point to track this all on the host or for amdgcn of
xeon phi offloading, nothing will use that info ever, so it is just wasted
memory and CPU cycles.
I don't understand how it can safely work, because if it needs to fallback
to the fixed behavior for master or single, why isn't user just using
  if (omp_get_thread_num () == 0)
{
  // whatever
}
etc. problematic too?
If it can for some reason work safely, then instead of adding
GOMP_task_set_used calls add some ifn call that will be after IPA folded to
nothing everywhere but on NVPTX and only have those calls on NVPTX, on the
library add some macros for the team->task_ever_used tweaks, defined to
nothing except for config/nvptx/*.h and limit the changes to PTX libgomp.a
then.
But I'm afraid a lot of code creates some asymmetric loads, even just a
work-sharing loop, if number of iterations isn't divisible by number of
threads, some threads could do less work, or with dynamic etc. schedules,
...

Jakub

[OpenMP, nvptx] Use bar.sync/arrive for barriers when tasking is not used

2022-09-01 Thread Chung-Lin Tang

wake ((int *) >generation, count == 0 ? INT_MAX : count);
+}
+
+void
+gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
+{
+  unsigned int generation, gen;
+
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+{
+  /* Next time we'll be awaiting TOTAL threads again.  */
+  struct gomp_thread *thr = gomp_thread ();
+  struct gomp_team *team = thr->ts.team;
+
+  bar->awaited = bar->total;
+  team->work_share_cancelled = 0;
+  if (__builtin_expect (team->task_count, 0))
+   {
+ gomp_barrier_handle_tasks (state);
+ state &= ~BAR_WAS_LAST;
+   }
+  else
+   {
+ state &= ~BAR_CANCELLED;
+ state += BAR_INCR - BAR_WAS_LAST;
+ __atomic_store_n (>generation, state, MEMMODEL_RELEASE);
+ futex_wake ((int *) >generation, INT_MAX);
+ return;
+   }
+}
+
+  generation = state;
+  state &= ~BAR_CANCELLED;
+  do
+{
+  do_wait ((int *) >generation, generation);
+  gen = __atomic_load_n (>generation, MEMMODEL_ACQUIRE);
+  if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
+   {
+ gomp_barrier_handle_tasks (state);
+ gen = __atomic_load_n (>generation, MEMMODEL_ACQUIRE);
+   }
+  generation |= gen & BAR_WAITING_FOR_TASK;
+}
+  while (gen != state + BAR_INCR);
+}
+
+void
+gomp_team_barrier_wait (gomp_barrier_t *bar)
+{
+  gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
+}
+
+void
+gomp_team_barrier_wait_final (gomp_barrier_t *bar)
+{
+  gomp_barrier_state_t state = gomp_barrier_wait_final_start (bar);
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+bar->awaited_final = bar->total;
+  gomp_team_barrier_wait_end (bar, state);
+}
+
+bool
+gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
+  gomp_barrier_state_t state)
+{
+  unsigned int generation, gen;
+
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+{
+  /* Next time we'll be awaiting TOTAL threads again.  */
+  /* BAR_CANCELLED should never be set in state here, because
+cancellation means that at least one of the threads has been
+cancelled, thus on a cancellable barrier we should never see
+all threads to arrive.  */
+  struct gomp_thread *thr = gomp_thread ();
+  struct gomp_team *team = thr->ts.team;
+
+  bar->awaited = bar->total;
+  team->work_share_cancelled = 0;
+  if (__builtin_expect (team->task_count, 0))
+   {
+ gomp_barrier_handle_tasks (state);
+ state &= ~BAR_WAS_LAST;
+   }
+  else
+   {
+ state += BAR_INCR - BAR_WAS_LAST;
+ __atomic_store_n (>generation, state, MEMMODEL_RELEASE);
+ futex_wake ((int *) >generation, INT_MAX);
+ return false;
+   }
+}
+
+  if (__builtin_expect (state & BAR_CANCELLED, 0))
+return true;
+
+  generation = state;
+  do
+{
+  do_wait ((int *) >generation, generation);
+  gen = __atomic_load_n (>generation, MEMMODEL_ACQUIRE);
+  if (__builtin_expect (gen & BAR_CANCELLED, 0))
+   return true;
+  if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
+   {
+ gomp_barrier_handle_tasks (state);
+ gen = __atomic_load_n (>generation, MEMMODEL_ACQUIRE);
+   }
+  generation |= gen & BAR_WAITING_FOR_TASK;
+}
+  while (gen != state + BAR_INCR);
+
+  return false;
+}
+
+bool
+gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
+{
+  return gomp_team_barrier_wait_cancel_end (bar, gomp_barrier_wait_start 
(bar));
+}
+
+void
+gomp_team_barrier_cancel (struct gomp_team *team)
+{
+  gomp_mutex_lock (>task_lock);
+  if (team->barrier.generation & BAR_CANCELLED)
+{
+  gomp_mutex_unlock (>task_lock);
+  return;
+}
+  team->barrier.generation |= BAR_CANCELLED;
+  gomp_mutex_unlock (>task_lock);
+  futex_wake ((int *) >barrier.generation, INT_MAX);
+}
-- 
2.8.1

From 2a621905bb91475e792ee1be9f06ea6145df0bc2 Mon Sep 17 00:00:00 2001
From: Chung-Lin Tang 
Date: Thu, 1 Sep 2022 07:04:42 -0700
Subject: [PATCH 2/2] openmp/nvptx: use bar.sync/arrive for barriers when
 tasking is not used

The nvptx implementation of futex_wait/wake ops, while enables OpenMP task
behavior on nvptx offloaded regions, can cause quite significant performance
regressions on some benchmarks.

However, when tasks-related functionality are not used at all by the team inside
an OpenMP target region, and a barrier is just a place to wait for all threads
to rejoin (with no problem of invoking waiting tasks to re-start) a barrier can
be implemented by simple bar.sync and bar.arrive PTX instructions, which can
bypass the heavy-weightness of nvptx tasks.

This patch adds a 'task_never_used' flag inside struct gomp_team, initialized
to true, and set to false when tasks are added to the team. The nvptx specific
g

Re: [OpenMP, nvptx] Use bar.sync/arrive for barriers when tasking is not used

[OpenMP, nvptx] Use bar.sync/arrive for barriers when tasking is not used

2 matches

Site Navigation

Mail list logo

Footer information