On Fri, Mar 24, 2023 at 04:30:38PM +0100, Frederik Harwath wrote: > this patch series implements the OpenMP 5.1 "unroll" and "tile" > constructs. It includes changes to the C,C++, and Fortran front end > for parsing the new constructs and a new middle-end > "omp_transform_loops" pass which implements the transformations in a > source language agnostic way.
I'm afraid we can't do it this way, at least not completely. The OpenMP requirements and what is being discussed for further loop transformations pretty much requires parts of it to be done as soon as possible. My understanding is that that is where other implementations implement that too and would also prefer GCC not to be the only implementation that takes significantly different decision in that case from other implementations like e.g. in the offloading case (where all other implementations preprocess/parse etc. source multiple times compared to GCC splitting stuff only at IPA time; this affects what can be done with metadirectives, declare variant etc.). Now, e.g. data sharing is done almost exclusively during gimplification, the proposed pass is later than that; it needs to be done before the data sharing. Ditto doacross handling. The normal loop constructs (OMP_FOR, OMP_SIMD, OMP_DISTRIBUTE, OMP_LOOP) already need to know given their collapse/ordered how many loops they are actually associated with and the loop transformation constructs can change that. So, I think we need to do the loop transformations in the FEs, that doesn't mean we need to write everything 3 times, once for each frontend. Already now, e.g. various stuff is shared between C and C++ FEs in c-family, though how much can be shared between c-family and Fortran is to be discovered. Or at least partially, to the extent that we compute how many canonical loops the loop transformations result in, what artificial iterators they will use etc., so that during gimplification we can take all that into account and then can do the actual transformations later. For C, I think the lowering of loop transformation constructs or at least determining what it means can be done right after we actually parse it and before we finalize the OMP_FOR eetc. that wraps it if any. As discussed last week at F2F, I think we want to remember in OMP_FOR_ORIG_DECLS the user iterators on the loop transformation constructs and take it into account for data sharing purposes. For C++ in templates we obviously need to defer that until instantiations, the constants in the clauses etc. could be template parameters etc. For Fortran during resolving. > The "unroll" and "tile" directives are > internally implemented as clauses. This fits the representation of So perhaps just use OMP_UNROLL/OMP_TILE as GENERIC constructs like OMP_FOR etc. but with some argument where from the early loop transformation analysis you can remember the important stuff, whether does the loop transformation result in a canonical loop nest or not and in the former case with how many nested loops. And then handle the actual transformation IMHO best at gimplification time, find them in the OMP_FOR etc. body if they are nested in there, let the transformation happen on GENERIC before the containing OMP_FOR etc. if any is actually finalized and from the transformation remember the original user decls and what should happen with them for data sharing (e.g. lastprivate/lastprivate conditional). >From the slides I saw last week, a lot of other transformations are in the planning, like loop reversal etc. And, I think even in OpenMP 5.1 nothing prevents e.g. #pragma omp for collapse(3) // etc. #pragma omp tile sizes (4, 2, 2) #pragma omp tile sizes (4, 8, 16) for (int i = 0; i < 64; ++i) for (int j = 0; j < 64; ++j) for (int k = 0; k < 64; ++k) body; where the inner tile takes the i and j loops and makes for (int i1 = 0; i1 < 64; i1 += 4) for (int j1 = 0; j1 < 64; j1 += 8) for (int k1 = 0; k1 < 64; k1 += 16) for (int i2 = 0; i2 < 4; i2++) { int i = i1 + i2; for (int j2 = 0; j2 < 8; j2++) { int j = j1 + j2; for (int k2 = 0; k2 < 16; k2++) { int k = k1 + k2; body; } } } out of it with 3 outer loops which have canonical loop form (the rest doesn't). And then the second tile takes the outermost 3 of those generated loops and tiles them again, making it into again 3 canonical loop form loops plus stuff inside of it. Or one can replace the #pragma omp for collapse(3) // etc. with #pragma omp for #pragma omp unroll partial(2) which furthermore unrolls the outermost generated loop from the outer tile turning it into 1 canonical loop form loop plus stuff in it. Or of course as you have in your testcases, some loop transformation constructs could be used on more nested loops, not necessarily before the outermost one. But still, in all cases you need to know quite early how many canonical loop form nested loops you get from each loop transformation, so that it can be e.g. checked against the collapse/ordered clauses. Feel free to disagree if you think your approach is able to handle all of this, just put details in why do you think so. Jakub