Re: omp workshare (PR35423) & beginner questions

Jakub Jelinek Mon, 14 Apr 2008 04:48:05 -0700

Hi!

On Wed, Apr 09, 2008 at 11:29:24PM -0500, Vasilis Liaskovitis wrote:
> I am a beginner interested in learning gcc internals and contributing
> to the community.


Thanks for showing interest in this area!

> I have started implementing PR35423 - omp workshare in the fortran
> front-end. I have some questions - any guidance and suggestions are
> welcome:
> 
> - For scalar assignments, wrapping them in OMP_SINGLE clause.

Yes, though if there is a couple of adjacent scalar assignments which don't
involve function calls and won't take too long to execute, you want
to put them all into one OMP_SINGLE.  If the assignments make take long
because of function calls and there are several such ones adjacent,
you can use OMP_WORKSHARE.

Furthermore, for all statements, not just the scalar ones, you want to
do dependency analysis between all the statements within !$omp workshare,
and make OMP_SINGLE, OMP_FOR or OMP_SECTIONS and add OMP_CLAUSE_NOWAIT
to them where no barrier is needed.

> - Array/subarray assignments: For assignments handled by the
> scalarizer,  I now create an OMP_FOR loop instead of a LOOP_EXPR for
> the outermost scalarized loop. This achieves worksharing at the
> outermost loop level.

Yes, though on gomp-3_0-branch you actually could use collapsed OMP_FOR
loop too.  Just bear in mind that for best performance at least with
static OMP_FOR scheduling ideally the same memory (part of array in this
case) is accessed by the same thread, as then it is in that CPU's caches.
Of course that's not always possible, but if it can be done, gfortran
should try that.

> Some array assignments are handled by functions (e.g.
> gfc_build_memcpy_call generates calls to memcpy). For these, I believe
> we need to divide the arrays into chunks and have each thread call the
> builtin function on its own chunk. E.g. If we have the following call
> in a parallel workshare construct:
> 
> memcpy(dst, src, len)
> 
> I generate this pseudocode:
> 
> {
>   numthreads = omp_get_numthreads();
>   chunksize = len / numthreads;
>   chunksize = chunksize + ( len != chunksize*numthreads)
> }
> 
> #omp for
>    for (i = 0; i < numthreads; i++) {
>           mysrc = src + i*chunksize;
>           mydst = dst + i*chunksize;
>           mylen = min(chunksize, len - (i*chunksize));
>           memcpy(mydst, mysrc, mylen);
>   }
> 
> If you have a suggestion to implement this in a simpler way, let me know.

Yeah, this possible.  Note though what I said above about cache locality.
And, if the memcpy size is known to be small doing it in OMP_SINGLE might
have advantages too.

> The above code executes parallel in every thread. Alternatively, the
> first block above can be wrapped in omp_single, but the numthreads &
> chunksize variables should then be
> declared shared instead of private. All the variables above
> are private by default, since they are declared in a parallel
> construct.

omp_get_num_threads is very cheap, and even with a division and
multiplication it most probably still be cheaper than OMP_SINGLE,
especially because it could not be NOWAIT.

> How can I set the scoping for a specific variable in a given
> omp for construct? Is the following correct to make a variable shared:
> 
> tmp = build_omp_clause(OMP_CLAUSE_SHARED);
> OMP_CLAUSE_DECL(tmp) = variable;
> omp_clauses = gfc_tran_add_clause(tmp, );

That, or just by letting the gimplifier set that up - if you don't
add OMP_CLAUSE_DEFAULT, by default loop iterators will be private,
the rest shared.

> -  I still need to do worksharing for array reduction operators (e.g.
> SUM,ALL, MAXLOC etc). For these, I think a combination of OMP_FOR/OMP_SINGLE 
> or
> OMP_REDUCTION is needed. I will also try to work on WHERE and
> FORALL statements.

I guess OMP_CLAUSE_REDUCTION for sum, max etc. will be best.  But testing
several variants on a bunch of testcases and benchmarking what is fastest
under what conditions is certainly the way to go in many cases.
Either you code it up in gfortran and try, or transform your original
!$omp workshare benchmarks into !$omp single, !$omp sections, !$omp for etc.
by hand and testing that is certainly possible too.

BTW, whenever you create OMP_FOR to handle part or whole !$omp workshare,
you should also choose the best scheduling kind.  You could just use
schedule(auto) and let the middle-end choose the best scheduling when
that support is actually added, but often the gfortran frontend might
know even better.

> I am also interested in gomp3 implementation and performance issues.
> If there are not-worked-on issues suitable for newbies, please share
> or update http://gcc.gnu.org/wiki/openmp. Can someone elaborate on the
> "Fine tune the auto scheduling feature for parallel loops" issue?

ATM the largest unfinished part of OpenMP 3.0 support is the tasking
support in libgomp using {[sg]et,make,swap}context family of functions,
but it is quite high on my todo list and I'd like to work on it soon.

As OpenMP 3.0 allows unsigned iterators for #pragma omp for, that is
something that should be fixed too even for corner cases, and long long
and unsigned long long iterators too.

The schedule(auto) needs some analysis of the loop, primarily whether
each iteration will need roughly the same time or varrying.  If the former,
schedule(static) might be best scheduling choice, otherwise e.g. some
kind of schedule(dynamic, N) for some carefully chosen N.

        Jakub

Re: omp workshare (PR35423) & beginner questions

Reply via email to