Re: [hsa 4/12] OpenMP lowering/expansion changes (gridification)

2015-11-12 Thread Jakub Jelinek
On Thu, Nov 05, 2015 at 10:57:33PM +0100, Martin Jambor wrote:
> the patch in this email contains the changes to make our OpenMP
> lowering and expansion machinery produce GPU kernels for a certain
> limited class of loops.  The plan is to make that class quite a big
> bigger, but only the following is ready for submission now.
> 
> Basically, whenever the compiler configured for HSAIL generation
> encounters the following pattern:
> 
>   #pragma omp target
>   #pragma omp teams thread_limit(workgroup_size) // thread_limit is optional
>   #pragma omp distribute parallel for firstprivate(n) private(i) 
> other_sharing_clauses()
> for (i = 0; i < n; i++)
>   some_loop_body

Do you support only lb 0 or any constant?  Only step 1?  Can the
b be constant, or just a variable?  If you need the number of iterations
computed before GOMP_target_ext, supposedly you also need to check that
n can't change in between target and the distribute (e.g. if it is
addressable or global var) and there are some statements in between.

What about schedule or dist_schedule clauses?  Only schedule(auto) or
missing schedule guarantees you you can distribute the work among the
threads any way the compiler wants.
dist_schedule is always static, but could have different chunk_size.

The current int num_threads, int thread_limit GOMP_target_ext arguments
perhaps could be changed to something like int num_args, long *args,
where args[0] would be the current num_threads and args[1] current
thread_limit, and if any offloading target that might benefit from knowing
the number of iterations of distribute parallel for that is the only
important statement inside, you could perhaps pass it as args[2] and pass
3 instead of 2 to num_args.  That could be something kind of generic
rather than HSA specific, and extensible.  But, looking at your
kernel_launch structure, you want something like multiple dimensions and
compute each dimension separately rather than combine (collapse) all
dimensions together, which is what OpenMP expansion does right now.

> While we have also been experimenting quite a bit with dynamic
> parallelism, we have only been able to achieve any good performance
> via this process of gridification.  The user can be notified whether a
> particular target construct was gridified or not via our process of
> dumping notes, which however only appear in the detailed dump.  I am
> seriously considering emitting some kind of warning, when HSA-enabled
> compiler is about to produce a non-gridified target code.

But then it would warn pretty much on all of libgomp testsuite with target
constructs in them...

> @@ -547,13 +548,13 @@ DEF_FUNCTION_TYPE_7 
> (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_UINT_PTR,

> --- a/gcc/fortran/types.def
> +++ b/gcc/fortran/types.def
> @@ -145,6 +145,7 @@ DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I2_INT, BT_VOID, 
> BT_VOLATILE_PTR, BT_I2, BT
>  DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I4_INT, BT_VOID, BT_VOLATILE_PTR, 
> BT_I4, BT_INT)
>  DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I8_INT, BT_VOID, BT_VOLATILE_PTR, 
> BT_I8, BT_INT)
>  DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I16_INT, BT_VOID, BT_VOLATILE_PTR, 
> BT_I16, BT_INT)
> +DEF_FUNCTION_TYPE_3 (BT_FN_VOID_PTR_INT_PTR, BT_VOID, BT_PTR, BT_INT, BT_PTR)
>  
>  DEF_FUNCTION_TYPE_4 (BT_FN_VOID_OMPFN_PTR_UINT_UINT,
>   BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT, BT_UINT)
> @@ -215,9 +216,9 @@ DEF_FUNCTION_TYPE_7 
> (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_UINT_PTR,
>  DEF_FUNCTION_TYPE_8 (BT_FN_VOID_OMPFN_PTR_UINT_LONG_LONG_LONG_LONG_UINT,
>BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT,
>BT_LONG, BT_LONG, BT_LONG, BT_LONG, BT_UINT)
> -DEF_FUNCTION_TYPE_8 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR,
> +DEF_FUNCTION_TYPE_9 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR,
>BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR,
> -  BT_PTR, BT_PTR, BT_UINT, BT_PTR)
> +  BT_PTR, BT_PTR, BT_UINT, BT_PTR, BT_PTR)

You'd need to move it if you add arguments (but as I said on the other
patch, this won't really apply on top of the trunk anyway).

> --- a/gcc/gimple.h
> +++ b/gcc/gimple.h
> @@ -153,6 +153,7 @@ enum gf_mask {
>  GF_OMP_FOR_KIND_TASKLOOP = 2,
>  GF_OMP_FOR_KIND_CILKFOR = 3,
>  GF_OMP_FOR_KIND_OACC_LOOP= 4,
> +GF_OMP_FOR_KIND_KERNEL_BODY = 5,
>  /* Flag for SIMD variants of OMP_FOR kinds.  */
>  GF_OMP_FOR_SIMD  = 1 << 3,
>  GF_OMP_FOR_KIND_SIMD = GF_OMP_FOR_SIMD | 0,
> @@ -621,8 +622,24 @@ struct GTY((tag("GSS_OMP_FOR")))
>/* [ WORD 11 ]
>   Pre-body evaluated before the loop body begins.  */
>gimple_seq pre_body;
> +
> +  /* [ WORD 12 ]
> + If set, this statement is part of a gridified kernel, its clauses need 
> to
> + be scanned and lowered but the statement should be discarded after
> + lowering.  */
> +  bool kernel_phony;

A bool flag is better put as a GF_OMP_* flag, there are s

Re: [hsa 4/12] OpenMP lowering/expansion changes (gridification)

2015-11-09 Thread Martin Jambor
Hi,

On Thu, Nov 05, 2015 at 10:57:33PM +0100, Martin Jambor wrote:
> 
 ... 
> 
> For convenience of anybody reviewing the code, I'm attaching a very
> simple testcase with selection of dumps that illustrate the whole
> process.
> 

My apologies, I have forgotten to attach the file, so let me quickly
correct that now.  The tar file consists of the source and a selection
of dumps generated by a compilation with "-fopenmp -O -S
-fdump-tree-all -fdump-tree-omplower-details" flags.

Thanks,

Martin


plusone.tgz
Description: application/compressed-tar


[hsa 4/12] OpenMP lowering/expansion changes (gridification)

2015-11-05 Thread Martin Jambor
Hi,

the patch in this email contains the changes to make our OpenMP
lowering and expansion machinery produce GPU kernels for a certain
limited class of loops.  The plan is to make that class quite a big
bigger, but only the following is ready for submission now.

Basically, whenever the compiler configured for HSAIL generation
encounters the following pattern:

  #pragma omp target
  #pragma omp teams thread_limit(workgroup_size) // thread_limit is optional
  #pragma omp distribute parallel for firstprivate(n) private(i) 
other_sharing_clauses()
for (i = 0; i < n; i++)
  some_loop_body

it creates a copy of the entire target body and expands it slightly
differently for concurrent execution on a GPU.  Note that both teams
and distribute constructs are mandatory.  Moreover, currently the
distribute has to be in a combined statement with the inner for
construct.  And there are quite a few other restrictions which I hope
to alleviate over the next year, most notably implement reductions.  A
few days ago I hoped to finish writing support for collapse(2) and
collapse(3) clauses in time for stage1 but now I am a bit sceptical.

The first phase of the "gridification" process is run before omp
"scanning" phase.  We look for the pattern above, and if we encounter
one, we copy its entire body into a new gimple statement
GIMPLE_OMP_GPUKERNEL.  Within it, we mark the teams, distribute and
parallel constructs with a new flag "kernel_phony."  This flag will
then make OMP lowering phase process their sharing clauses like usual,
but the statements representing the constructs will be removed at
lowering (and thus will never be expanded).  The resulting wasteful
repackaging of data is nicely cleaned by our optimizers even at -O1.

At expansion time, we identify gomp_target statements with a kernel
and expand the kernel into a special function, with the loop
represented by the GPU grid and not control flow.  Afterwards, the
normal body of the target is expanded as usual.  Finally, we need to
take the grid dimensions stored within new fields of the target
statement by the first phase, store in a structure and pass them to
libgomp in a new parameter of GOMP_target_41.

Originally, when I started with the above pattern matching, I did not
allow any other gimple statements in between the respective omp
constructs.  That however proved to be too restrictive for two
reasons.  First, statements in pre-bodies of both distribute and for
loops needed to be accounted for when calculating the kernel grid size
(which is done before the target statement itself) and second, Fortran
parameter dereferences happily result in interleaving statements when
there were none in the user source code.

Therefore, I allow register-type stores to local non-addressable
variables in pre-bodies and also in between the OMP constructs.  All
of them are copied in front of the target statement and either used
for grid size calculation or removed as useless by later
optimizations.

For convenience of anybody reviewing the code, I'm attaching a very
simple testcase with selection of dumps that illustrate the whole
process.

While we have also been experimenting quite a bit with dynamic
parallelism, we have only been able to achieve any good performance
via this process of gridification.  The user can be notified whether a
particular target construct was gridified or not via our process of
dumping notes, which however only appear in the detailed dump.  I am
seriously considering emitting some kind of warning, when HSA-enabled
compiler is about to produce a non-gridified target code.

I hope that eventually I managed to write the gridification in a way
that interferes very little with the rest of the OMP pipeline and yet
only re-implement the bare necessary minimum of functionality that is
already there.  I'll be grateful for any feedback regarding the
approach.

Thanks,

Martin


2015-11-05  Martin Jambor  

* builtin-types.def (BT_FN_VOID_PTR_INT_PTR): New.
(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR): Removed.
(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR): New.
* fortran/types.def (BT_FN_VOID_PTR_INT_PTR): New.
(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR): Removed.
(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR): New.
* gimple-low.c (lower_stmt): Handle GIMPLE_OMP_GPUKERNEL.
* gimple-pretty-print.c (dump_gimple_omp_for): Likewise.
(dump_gimple_omp_block): Handle GF_OMP_FOR_KIND_KERNEL_BODY
(pp_gimple_stmt_1): Handle GIMPLE_OMP_GPUKERNEL.
* gimple-walk.c (walk_gimple_stmt): Likewise.
* gimple.c (gimple_build_omp_gpukernel): New function.
(gimple_omp_target_init_dimensions): Likewise.
(gimple_copy): Handle GIMPLE_OMP_GPUKERNEL.
* gimple.def (GIMPLE_OMP_TEAMS): Moved into its own layout.
(GIMPLE_OMP_GPUKERNEL): New.
* gimple.h (gf_mask): New element GF_OMP_FOR_KIND_KERNEL_BODY.
(gomp_for): New