Hi,

Thanks for the help. Some more questions:

1) I am trying to workshare reduction operators, currently working on
SUM.

       INTEGER N
     REAL AA(N), MYSUM
 !$OMP PARALLEL
 !$OMP WORKSHARE
             MYSUM = SUM(AA)
 !$OMP END WORKSHARE
 !$OMP END PARALLEL

To compute SUM, the scalarizer creates a temporary variable (let's call
it val2) for accumulating the sum.

In order to workshare the sum, I am attempting to create an OMP_FOR loop
with an omp reduction clause for the temporary val2. In pseudocode this
would be

OMP DO REDUCTION(+:val)
       DO I=1,N
       val2 = val2 + AA[I]
END DO

The problem is that I get an error from the gimplifier: "reduction
variable val.2 is private in outer context". I think this is because the
parallel region assumes val2 is a private variable.

I have tried creating an extra omp clause shared for val2

sharedreduction = build_omp_clause(OMP_CLAUSE_SHARED);
OMP_CLAUSE_DECL(sharedreduction) = reduction_variable;

where reduction_variable is the tree node for val2. I am attaching this
clause to the clauses of the OMP_PARALLEL construct.

Doing this breaks the following assertion in gimplify.c:omp_add_variable

/* The only combination of data sharing classes we should see is
        FIRSTPRIVATE and LASTPRIVATE.  */
     nflags = n->value | flags;
     gcc_assert ((nflags & GOVD_DATA_SHARE_CLASS)
                 == (GOVD_FIRSTPRIVATE | GOVD_LASTPRIVATE));

I think this happens because val2 is first added with GOVD_SHARED |
GOVD_EXPLICIT flags because of my shared clause, and later re-added
(from the default parallel construct handling?) with GOVD_LOCAL |
GOVD_SEEN attributes.

Ignoring this, another assertion breaks in expr.c:

/* Variables inherited from containing functions should have
        been lowered by this point.  */
     context = decl_function_context (exp);
gcc_assert (!context
                 || context == current_function_decl
                 || TREE_STATIC (exp)
                 /* ??? C++ creates functions that are not
TREE_STATIC*/
                 || TREE_CODE (exp) == FUNCTION_DECL);

I guess val2 is not lowered properly? Ignoring this assertion triggers
an rtl error (assigning wrong machine codes DI to SF) so something is
definitely wrong.
Do I need to attach val2's tree node declaration somewhere else?

2) again for the reduction operators, I would subsequently do the scalar
assignment MYSUM = val2 by one thread using omp single. Is there a
better way? I don't think I can use the program-defined mysum as the
reduction variable inside the sum loop because the rhs needs to be
evaluated before the lhs is assigned to.

3) gfc_check_dependency seems to be an appropriate helper function for
the dependence analysis in the statements of the workshare . If you have
other suggestions let me know.

thanks,

- Vasilis

On Mon, Apr 14, 2008 at 6:47 AM, Jakub Jelinek <[EMAIL PROTECTED]> wrote:
> Hi!
>
>
>  On Wed, Apr 09, 2008 at 11:29:24PM -0500, Vasilis Liaskovitis wrote:
>  > I am a beginner interested in learning gcc internals and contributing
>  > to the community.
>
>  Thanks for showing interest in this area!
>
>
>  > I have started implementing PR35423 - omp workshare in the fortran
>  > front-end. I have some questions - any guidance and suggestions are
>  > welcome:
>  >
>  > - For scalar assignments, wrapping them in OMP_SINGLE clause.
>
>  Yes, though if there is a couple of adjacent scalar assignments which don't
>  involve function calls and won't take too long to execute, you want
>  to put them all into one OMP_SINGLE.  If the assignments make take long
>  because of function calls and there are several such ones adjacent,
>  you can use OMP_WORKSHARE.
>
>  Furthermore, for all statements, not just the scalar ones, you want to
>  do dependency analysis between all the statements within !$omp workshare,
>  and make OMP_SINGLE, OMP_FOR or OMP_SECTIONS and add OMP_CLAUSE_NOWAIT
>  to them where no barrier is needed.
>
>
>  > - Array/subarray assignments: For assignments handled by the
>  > scalarizer,  I now create an OMP_FOR loop instead of a LOOP_EXPR for
>  > the outermost scalarized loop. This achieves worksharing at the
>  > outermost loop level.
>
>  Yes, though on gomp-3_0-branch you actually could use collapsed OMP_FOR
>  loop too.  Just bear in mind that for best performance at least with
>  static OMP_FOR scheduling ideally the same memory (part of array in this
>  case) is accessed by the same thread, as then it is in that CPU's caches.
>  Of course that's not always possible, but if it can be done, gfortran
>  should try that.
>
>
>  > Some array assignments are handled by functions (e.g.
>  > gfc_build_memcpy_call generates calls to memcpy). For these, I believe
>  > we need to divide the arrays into chunks and have each thread call the
>  > builtin function on its own chunk. E.g. If we have the following call
>  > in a parallel workshare construct:
>  >
>  > memcpy(dst, src, len)
>  >
>  > I generate this pseudocode:
>  >
>  > {
>  >   numthreads = omp_get_numthreads();
>  >   chunksize = len / numthreads;
>  >   chunksize = chunksize + ( len != chunksize*numthreads)
>  > }
>  >
>  > #omp for
>  >    for (i = 0; i < numthreads; i++) {
>  >           mysrc = src + i*chunksize;
>  >           mydst = dst + i*chunksize;
>  >           mylen = min(chunksize, len - (i*chunksize));
>  >           memcpy(mydst, mysrc, mylen);
>  >   }
>  >
>  > If you have a suggestion to implement this in a simpler way, let me know.
>
>  Yeah, this possible.  Note though what I said above about cache locality.
>  And, if the memcpy size is known to be small doing it in OMP_SINGLE might
>  have advantages too.
>
>
>  > The above code executes parallel in every thread. Alternatively, the
>  > first block above can be wrapped in omp_single, but the numthreads &
>  > chunksize variables should then be
>  > declared shared instead of private. All the variables above
>  > are private by default, since they are declared in a parallel
>  > construct.
>
>  omp_get_num_threads is very cheap, and even with a division and
>  multiplication it most probably still be cheaper than OMP_SINGLE,
>  especially because it could not be NOWAIT.
>
>
>  > How can I set the scoping for a specific variable in a given
>  > omp for construct? Is the following correct to make a variable shared:
>  >
>  > tmp = build_omp_clause(OMP_CLAUSE_SHARED);
>  > OMP_CLAUSE_DECL(tmp) = variable;
>  > omp_clauses = gfc_tran_add_clause(tmp, );
>
>  That, or just by letting the gimplifier set that up - if you don't
>  add OMP_CLAUSE_DEFAULT, by default loop iterators will be private,
>  the rest shared.
>
>
>  > -  I still need to do worksharing for array reduction operators (e.g.
>  > SUM,ALL, MAXLOC etc). For these, I think a combination of 
> OMP_FOR/OMP_SINGLE or
>  > OMP_REDUCTION is needed. I will also try to work on WHERE and
>  > FORALL statements.
>
>  I guess OMP_CLAUSE_REDUCTION for sum, max etc. will be best.  But testing
>  several variants on a bunch of testcases and benchmarking what is fastest
>  under what conditions is certainly the way to go in many cases.
>  Either you code it up in gfortran and try, or transform your original
>  !$omp workshare benchmarks into !$omp single, !$omp sections, !$omp for etc.
>  by hand and testing that is certainly possible too.
>
>  BTW, whenever you create OMP_FOR to handle part or whole !$omp workshare,
>  you should also choose the best scheduling kind.  You could just use
>  schedule(auto) and let the middle-end choose the best scheduling when
>  that support is actually added, but often the gfortran frontend might
>  know even better.
>
>
>  > I am also interested in gomp3 implementation and performance issues.
>  > If there are not-worked-on issues suitable for newbies, please share
>  > or update http://gcc.gnu.org/wiki/openmp. Can someone elaborate on the
>  > "Fine tune the auto scheduling feature for parallel loops" issue?
>
>  ATM the largest unfinished part of OpenMP 3.0 support is the tasking
>  support in libgomp using {[sg]et,make,swap}context family of functions,
>  but it is quite high on my todo list and I'd like to work on it soon.
>
>  As OpenMP 3.0 allows unsigned iterators for #pragma omp for, that is
>  something that should be fixed too even for corner cases, and long long
>  and unsigned long long iterators too.
>
>  The schedule(auto) needs some analysis of the loop, primarily whether
>  each iteration will need roughly the same time or varrying.  If the former,
>  schedule(static) might be best scheduling choice, otherwise e.g. some
>  kind of schedule(dynamic, N) for some carefully chosen N.
>
>         Jakub
>

Reply via email to