Re: [gomp4] lowering OpenACC reductions

2015-08-26 Thread Cesar Philippidis
On 08/21/2015 02:00 PM, Cesar Philippidis wrote:

 This patch teaches omplower how to utilize the new OpenACC reduction
 framework described in Nathan's document, which was posted here
 https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01248.html. Here is the
 infrastructure patch
 https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01130.html, and here's
 the nvptx backend changes
 https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01334.html. The updated
 reduction tests have been posted here
 https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01561.html.

All of these patches have been committed to gomp-4_0-branch.

Cesar


[gomp4] lowering OpenACC reductions

2015-08-21 Thread Cesar Philippidis
This patch teaches omplower how to utilize the new OpenACC reduction
framework described in Nathan's document, which was posted here
https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01248.html. Here is the
infrastructure patch
https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01130.html, and here's
the nvptx backend changes
https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01334.html. The updated
reduction tests have been posted here
https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01561.html.

The existing reduction code in gomp-4_0-branch is doing a couple a
quirky things, like creating a special ganglocal copy for the private
reduction variables. Those ganglocal variables were mapped into shared
memory for nvidia gpus and a special malloc'ed buffer for everything
else. That worked, but it too target-specific and it didn't solve the
vector reduction problem. Part of this patch  eliminates the need for
those ganglocal data, at least from lowering code.

Looking at this patch, you might see a reference to fake gang
reductions. The idea behind that, which Nathan describes in his design
document, is that only gang's can access global data mappings, not
worker or vectors. This restriction allows us to cascade multiple
reductions with multiple levels of parallelism using a common interface.
Here's a worker reduction example taken from Nathan's design:

  //#pragma acc parallel loop worker copy(a) reduction (+:a)
  {
// Insert dummy gang reduction at start.
// Note this uses the same RID  LID as the inner worker loop.
a = IFN_SETUP (ompstruct­a, a, GANG, +, 0, 0)
a = IFN_INIT (ompstruct­a, a, GANG, +, 0, 0)
#loop worker reduction(+:a)
a = IFN_SETUP (NULL, a, WORKER, +, 0, 0)
IFN_FORK (WORKER)
a = IFN_INIT (NULL, a, WORKER, +, 0, 0)
for (...) { ... }
IFN_LOCK (WORKER, 0)
a = IFN_FINI (NULL, a, WORKER, +, 0, 0)
IFN_UNLOCK (WORKER, 0)
IFN_JOIN (WORKER)
a = IFN_TEARDOWN (NULL, a, WORKER, +, 0, 0)
// Dummy gang reduction at end
a = IFN_FINI (ompstruct­a, a, GANG, +, 0, 0)
a = IFN_TEARDOWN (ompstruct­a, a, GANG, +, 0, 0)
  }

Note that while this loop doesn't have a gang associated with it, it
does have a fake gang reduction to update the original value. If 'a' was
private, then the gang reduction wouldn't be necessary.

Now for the reduction changes. Starting with the gimplifier, you'll note
that I introduced a function to rewrite reference-typed variables as
non-references. This was initially done to solve the problem with
fortran subroutines, but I'm also using it for reductions that are not
associated with loops (e.g. 'acc parallel reduction (+:foo) copy
(foo)'). The justification for this variable rewriting is as follows:

  * The gimplifier expands reference types to use indirection before it
reaches omplower. So if I were to wait for omplower to rewrite the
variable, I'd have to rewrite possibly three instructions instead of
just one. This solution is just a little more straightforward.

  * Non-loop reductions are kind of tricky. On one hand, we want to the
global copy of the reduction variable to be mapped onto the
accelerator. On the other hand, we don't that the code inside the
parallel region to use the global copy. So that's why I introduced
a new copy of the reduction variable in the gimplifier.

The way that reductions work in acc loops is that each loop creates
a private copy of the reduction variable. Then when it comes time to
updating the original global copy, the lowering code would get the
reference to the reduction variable in its parent omp_context.
There's no parent context for parallel constructs, so the private
copy of the reduction variable would be overwritten. Hence, the
gimplifier pass attaches a private variable to omp clause itself.

If anyone has have a better solution for either of these two problems,
let me know.

The next major change is that lower_omp_for is responsible for inserting
calls for GOACC_FORK and GOACC_JOIN. One thing that does concern me
about this change is that par-loops will need to become aware of that in
insert those calls as necessary. Technically, it should be ok for now
because par-loops doesn't support workers and vectors yet. But if we go
with this change, par-loops will need to be updated eventually.

Is this ok for gomp-4_0-branch?

Cesar
2015-08-21  Cesar Philippidis  ce...@codesourcery.com

	gcc/
	* gimplify.c (struct privatize_reduction): New struct.
	(localize_reductions_r): New function.
	(localize_reductions): New function.
	(gimplify_omp_for): Use it.
	(gimplify_omp_workshare): Likweise.
	* omp-low.c (struct omp_context): Remove reduction_map and
	oacc_reduction_set. Add 'int reductions'.
	(oacc_gang_reduction_init): New gimple_seq to contain initialization
	code for fake gang reductions.
	(oacc_gang_reduction_fini): Ditto, but for finalization code.
	(extract_oacc_loop_mask): New function.
	(is_oacc_reduction_private): New function.