Hi, Thanks for the help. Some more questions:
1) I am trying to workshare reduction operators, currently working on SUM. INTEGER N REAL AA(N), MYSUM !$OMP PARALLEL !$OMP WORKSHARE MYSUM = SUM(AA) !$OMP END WORKSHARE !$OMP END PARALLEL To compute SUM, the scalarizer creates a temporary variable (let's call it val2) for accumulating the sum. In order to workshare the sum, I am attempting to create an OMP_FOR loop with an omp reduction clause for the temporary val2. In pseudocode this would be OMP DO REDUCTION(+:val) DO I=1,N val2 = val2 + AA[I] END DO The problem is that I get an error from the gimplifier: "reduction variable val.2 is private in outer context". I think this is because the parallel region assumes val2 is a private variable. I have tried creating an extra omp clause shared for val2 sharedreduction = build_omp_clause(OMP_CLAUSE_SHARED); OMP_CLAUSE_DECL(sharedreduction) = reduction_variable; where reduction_variable is the tree node for val2. I am attaching this clause to the clauses of the OMP_PARALLEL construct. Doing this breaks the following assertion in gimplify.c:omp_add_variable /* The only combination of data sharing classes we should see is FIRSTPRIVATE and LASTPRIVATE. */ nflags = n->value | flags; gcc_assert ((nflags & GOVD_DATA_SHARE_CLASS) == (GOVD_FIRSTPRIVATE | GOVD_LASTPRIVATE)); I think this happens because val2 is first added with GOVD_SHARED | GOVD_EXPLICIT flags because of my shared clause, and later re-added (from the default parallel construct handling?) with GOVD_LOCAL | GOVD_SEEN attributes. Ignoring this, another assertion breaks in expr.c: /* Variables inherited from containing functions should have been lowered by this point. */ context = decl_function_context (exp); gcc_assert (!context || context == current_function_decl || TREE_STATIC (exp) /* ??? C++ creates functions that are not TREE_STATIC*/ || TREE_CODE (exp) == FUNCTION_DECL); I guess val2 is not lowered properly? Ignoring this assertion triggers an rtl error (assigning wrong machine codes DI to SF) so something is definitely wrong. Do I need to attach val2's tree node declaration somewhere else? 2) again for the reduction operators, I would subsequently do the scalar assignment MYSUM = val2 by one thread using omp single. Is there a better way? I don't think I can use the program-defined mysum as the reduction variable inside the sum loop because the rhs needs to be evaluated before the lhs is assigned to. 3) gfc_check_dependency seems to be an appropriate helper function for the dependence analysis in the statements of the workshare . If you have other suggestions let me know. thanks, - Vasilis On Mon, Apr 14, 2008 at 6:47 AM, Jakub Jelinek <[EMAIL PROTECTED]> wrote: > Hi! > > > On Wed, Apr 09, 2008 at 11:29:24PM -0500, Vasilis Liaskovitis wrote: > > I am a beginner interested in learning gcc internals and contributing > > to the community. > > Thanks for showing interest in this area! > > > > I have started implementing PR35423 - omp workshare in the fortran > > front-end. I have some questions - any guidance and suggestions are > > welcome: > > > > - For scalar assignments, wrapping them in OMP_SINGLE clause. > > Yes, though if there is a couple of adjacent scalar assignments which don't > involve function calls and won't take too long to execute, you want > to put them all into one OMP_SINGLE. If the assignments make take long > because of function calls and there are several such ones adjacent, > you can use OMP_WORKSHARE. > > Furthermore, for all statements, not just the scalar ones, you want to > do dependency analysis between all the statements within !$omp workshare, > and make OMP_SINGLE, OMP_FOR or OMP_SECTIONS and add OMP_CLAUSE_NOWAIT > to them where no barrier is needed. > > > > - Array/subarray assignments: For assignments handled by the > > scalarizer, I now create an OMP_FOR loop instead of a LOOP_EXPR for > > the outermost scalarized loop. This achieves worksharing at the > > outermost loop level. > > Yes, though on gomp-3_0-branch you actually could use collapsed OMP_FOR > loop too. Just bear in mind that for best performance at least with > static OMP_FOR scheduling ideally the same memory (part of array in this > case) is accessed by the same thread, as then it is in that CPU's caches. > Of course that's not always possible, but if it can be done, gfortran > should try that. > > > > Some array assignments are handled by functions (e.g. > > gfc_build_memcpy_call generates calls to memcpy). For these, I believe > > we need to divide the arrays into chunks and have each thread call the > > builtin function on its own chunk. E.g. If we have the following call > > in a parallel workshare construct: > > > > memcpy(dst, src, len) > > > > I generate this pseudocode: > > > > { > > numthreads = omp_get_numthreads(); > > chunksize = len / numthreads; > > chunksize = chunksize + ( len != chunksize*numthreads) > > } > > > > #omp for > > for (i = 0; i < numthreads; i++) { > > mysrc = src + i*chunksize; > > mydst = dst + i*chunksize; > > mylen = min(chunksize, len - (i*chunksize)); > > memcpy(mydst, mysrc, mylen); > > } > > > > If you have a suggestion to implement this in a simpler way, let me know. > > Yeah, this possible. Note though what I said above about cache locality. > And, if the memcpy size is known to be small doing it in OMP_SINGLE might > have advantages too. > > > > The above code executes parallel in every thread. Alternatively, the > > first block above can be wrapped in omp_single, but the numthreads & > > chunksize variables should then be > > declared shared instead of private. All the variables above > > are private by default, since they are declared in a parallel > > construct. > > omp_get_num_threads is very cheap, and even with a division and > multiplication it most probably still be cheaper than OMP_SINGLE, > especially because it could not be NOWAIT. > > > > How can I set the scoping for a specific variable in a given > > omp for construct? Is the following correct to make a variable shared: > > > > tmp = build_omp_clause(OMP_CLAUSE_SHARED); > > OMP_CLAUSE_DECL(tmp) = variable; > > omp_clauses = gfc_tran_add_clause(tmp, ); > > That, or just by letting the gimplifier set that up - if you don't > add OMP_CLAUSE_DEFAULT, by default loop iterators will be private, > the rest shared. > > > > - I still need to do worksharing for array reduction operators (e.g. > > SUM,ALL, MAXLOC etc). For these, I think a combination of > OMP_FOR/OMP_SINGLE or > > OMP_REDUCTION is needed. I will also try to work on WHERE and > > FORALL statements. > > I guess OMP_CLAUSE_REDUCTION for sum, max etc. will be best. But testing > several variants on a bunch of testcases and benchmarking what is fastest > under what conditions is certainly the way to go in many cases. > Either you code it up in gfortran and try, or transform your original > !$omp workshare benchmarks into !$omp single, !$omp sections, !$omp for etc. > by hand and testing that is certainly possible too. > > BTW, whenever you create OMP_FOR to handle part or whole !$omp workshare, > you should also choose the best scheduling kind. You could just use > schedule(auto) and let the middle-end choose the best scheduling when > that support is actually added, but often the gfortran frontend might > know even better. > > > > I am also interested in gomp3 implementation and performance issues. > > If there are not-worked-on issues suitable for newbies, please share > > or update http://gcc.gnu.org/wiki/openmp. Can someone elaborate on the > > "Fine tune the auto scheduling feature for parallel loops" issue? > > ATM the largest unfinished part of OpenMP 3.0 support is the tasking > support in libgomp using {[sg]et,make,swap}context family of functions, > but it is quite high on my todo list and I'd like to work on it soon. > > As OpenMP 3.0 allows unsigned iterators for #pragma omp for, that is > something that should be fixed too even for corner cases, and long long > and unsigned long long iterators too. > > The schedule(auto) needs some analysis of the loop, primarily whether > each iteration will need roughly the same time or varrying. If the former, > schedule(static) might be best scheduling choice, otherwise e.g. some > kind of schedule(dynamic, N) for some carefully chosen N. > > Jakub >