[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #23 from vries at gcc dot gnu.org --- Author: vries Date: Mon Nov 30 16:34:26 2015 New Revision: 231076 URL: https://gcc.gnu.org/viewcvs?rev=231076=gcc=rev Log: Handle BUILT_IN_GOMP_PARALLEL in ipa-pta 2015-11-30 Tom de VriesPR tree-optimization/46032 * tree-ssa-structalias.c (find_func_aliases_for_call_arg): New function, factored out of ... (find_func_aliases_for_call): ... here. (find_func_aliases_for_builtin_call, find_func_clobbers): Handle BUILT_IN_GOMP_PARALLEL. (ipa_pta_execute): Same. Handle node->parallelized_function as a local function. * gcc.dg/pr46032.c: New test. * testsuite/libgomp.c/pr46032.c: New test. Added: trunk/gcc/testsuite/gcc.dg/pr46032.c trunk/libgomp/testsuite/libgomp.c/pr46032.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-structalias.c trunk/libgomp/ChangeLog
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 vries at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED Assignee|unassigned at gcc dot gnu.org |vries at gcc dot gnu.org --- Comment #24 from vries at gcc dot gnu.org --- patch with testcase committed, marking resolved-fixed.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 vries at gcc dot gnu.org changed: What|Removed |Added Keywords||missed-optimization, patch Severity|major |enhancement --- Comment #22 from vries at gcc dot gnu.org --- https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03448.html
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #20 from vries at gcc dot gnu.org --- This patch seems to have the desired effect on the original testcase: ... diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 830db75..996756b 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -9361,6 +9361,7 @@ expand_omp_for_static_nochunk (struct omp_region *region, if (collapse_bb == NULL) loop->latch = cont_bb; add_loop (loop, body_bb->loop_father); + loop->safelen = INT_MAX; } } ... AFAIU, adding the omp for to the loop is an assertion that the loop is independent. It seems reasonable to assume that if the original loop was independent, the loop operating on a slice of the original iteration space will be independent as well.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #21 from Jakub Jelinek --- (In reply to vries from comment #20) > This patch seems to have the desired effect on the original testcase: > ... > diff --git a/gcc/omp-low.c b/gcc/omp-low.c > index 830db75..996756b 100644 > --- a/gcc/omp-low.c > +++ b/gcc/omp-low.c > @@ -9361,6 +9361,7 @@ expand_omp_for_static_nochunk (struct omp_region > *region, >if (collapse_bb == NULL) > loop->latch = cont_bb; >add_loop (loop, body_bb->loop_father); > + loop->safelen = INT_MAX; > } > } > ... > > AFAIU, adding the omp for to the loop is an assertion that the loop is > independent. It seems reasonable to assume that if the original loop was > independent, the loop operating on a slice of the original iteration space > will be independent as well. That is very much wrong. Static scheduling, both nochunk and chunk, doesn't imply in any way that the iterations are independent, the OpenMP standard says how the work is split among the threads, with nochunk that threads get consecutive sets of iterations as one chunk that are approximately the same size, but eventhough it is not exactly specified how exactly the iteration space is deviced (for nochunk), if you make the loop iterations independent, you would break many observable properties (say through threadprivate vars, omp_get_thread_num etc.). Note loop->safelen == INT_MAX is actually weaker than independent iterations, when loop->safelen == INT_MAX, there can be dependencies, but only of certain kinds, it says that it is equivalent if you run the loop normally and if you run simultaneously (or emulated) the first statements of all the iterations, then second statements and so on (so vectorize with any vectorization factor the compiler wants).
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #17 from Richard Biener rguenth at gcc dot gnu.org --- (In reply to vries from comment #16) (In reply to Richard Biener from comment #12) (In reply to vries from comment #11) The ipa-pta solution no longer works. In 4.6, we had: ... # USE = anything # CLB = anything GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 0); # USE = anything # CLB = anything main._omp_fn.0D.1472 (.omp_data_o.1D.1484); # USE = anything # CLB = anything GOMP_parallel_endD.1049 (); ... On trunk, we have now: ... # USE = anything # CLB = anything GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0); ... So there's no longer a path in the call graph from main to main._omp_fn. Perhaps a dummy body for GOMP_parallel could fix that. Hm? The IPA PTA solution was to tell IPA PTA that the call to GOMP_parallel [ GOMP_parallel_start ] doesn't make .omp_data_o escape. Right, for 4.6, adding fnspec .rw to GOMP_parallel_start has this effect in ipa-pta: ... D.1505_14 = { ESCAPED NONLOCAL pData } D.1509_18 = { ESCAPED NONLOCAL results } -- D.1505_14 = { pData } D.1509_18 = { results } ... where _14 and _18 are the omp_data_i relative loads in the split-off function: ... # VUSE .MEMD.1514_20 # PT = nonlocal D.1505_14 = .omp_data_iD.1474_13(D)-pDataD.1477; # VUSE .MEMD.1514_20 D.1506_15 = *D.1505_14[idxD.1495_1]; ... # VUSE .MEMD.1514_20 # PT = nonlocal D.1509_18 = .omp_data_iD.1474_13(D)-resultsD.1479; # .MEMD.1514_22 = VDEF .MEMD.1514_20 *D.1509_18[idxD.1495_1] = D.1508_17; ... The attached patch doesn't work because it only patches GOMP_parallel_start, not GOMP_parallel. [ GOMP_parallel_start is no longer around on trunk. ] Applying the 4.6 patch on trunk (and dropping the loop in the hunk for intra_create_variable_infos that does not apply cleanly anymore) and applying fnspec .rw on GOMP_parallel, gives us in ipa-pta: ... _17 = { } _21 = { } ... where _17 and _21 are the omp_data_i relative loads in the split-off function: ... # VUSE .MEM_4 # PT = nonlocal escaped _17 = MEM[(struct .omp_data_s.0D.1713 ).omp_data_i_16(D) clique 1 base 1].pDataD.1719; # VUSE .MEM_4 _18 = *_17[idx_1]; # VUSE .MEM_4 # PT = nonlocal escaped _21 = MEM[(struct .omp_data_s.0D.1713 ).omp_data_i_16(D) clique 1 base 1].resultsD.1721; # .MEM_22 = VDEF .MEM_4 *_21[idx_1] = _20; ... It is reasonable to assume that we no longer are able to relate back these loads in the split-off function to pData and result in the donor function, due to the fact that there's no longer a direct function call to main._omp_fn in the donor function. On 4.6, that direct function call to main._omp_fn still existed. On trunk, not anymore. In fact it even looks like wrong IPA PTA results to me (_17 and _21 point to nothing). Index: gcc/tree-ssa-structalias.c === --- gcc/tree-ssa-structalias.c (revision 223737) +++ gcc/tree-ssa-structalias.c (working copy) @@ -7372,7 +7372,8 @@ ipa_pta_execute (void) constraints for parameters. */ if (node-used_from_other_partition || node-externally_visible - || node-force_output) + || node-force_output + || node-address_taken) { intra_create_variable_infos (func); fixes that. Of course that makes a solution handling the OMP builtins specially not going to work as if the function has its address taken we don't know whether it is called from anywhere else. The fix is required for correctness though. _17 = { ESCAPED NONLOCAL } _21 = { ESCAPED NONLOCAL } _18 = *_17[idx_3]; *_21[idx_3] = _20; handling the OMP builtin specially will only get the solution amended to { ESCAPED NONLOCAL results } and { ESCAPED NONLOCAL pData } and thus still conflict.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #19 from rguenther at suse dot de rguenther at suse dot de --- On Wed, 27 May 2015, jakub at gcc dot gnu.org wrote: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #18 from Jakub Jelinek jakub at gcc dot gnu.org --- The *.omp_fn.* functions indeed, while they necessarily have to be addressable, because that is how they are passed to the libgomp entrypoints, are never called by anything but the libgomp runtime. For GOMP_parallel*, they are only called before the GOMP_parallel* function exits, for GOMP_task* they could be called at some later point. Ok, so this just means that IPA PTA would need to handle those specially (and thus the OMP functions should be marked specially in the cgraph node). Not that I think IPA PTA is anywhere near production ready (or I have time to fix it up properly...). Just testing the addressable fix now.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #18 from Jakub Jelinek jakub at gcc dot gnu.org --- The *.omp_fn.* functions indeed, while they necessarily have to be addressable, because that is how they are passed to the libgomp entrypoints, are never called by anything but the libgomp runtime. For GOMP_parallel*, they are only called before the GOMP_parallel* function exits, for GOMP_task* they could be called at some later point.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #12 from Richard Biener rguenth at gcc dot gnu.org --- (In reply to vries from comment #11) The ipa-pta solution no longer works. In 4.6, we had: ... # USE = anything # CLB = anything GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 0); # USE = anything # CLB = anything main._omp_fn.0D.1472 (.omp_data_o.1D.1484); # USE = anything # CLB = anything GOMP_parallel_endD.1049 (); ... On trunk, we have now: ... # USE = anything # CLB = anything GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0); ... So there's no longer a path in the call graph from main to main._omp_fn. Perhaps a dummy body for GOMP_parallel could fix that. Hm? The IPA PTA solution was to tell IPA PTA that the call to GOMP_parallel doesn't make .omp_data_o escape. The attached patch doesn't work because it only patches GOMP_parallel_start, not GOMP_parallel. Of course it would even better to teach IPA PTA that GOMP_parallel is really invoking main._omp_fn.0 with a .omp_data_o.1 argument. How many different ways of IL do we get doing this kind of indirect function invocations?
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #13 from Richard Biener rguenth at gcc dot gnu.org --- (In reply to Richard Biener from comment #12) (In reply to vries from comment #11) The ipa-pta solution no longer works. In 4.6, we had: ... # USE = anything # CLB = anything GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 0); # USE = anything # CLB = anything main._omp_fn.0D.1472 (.omp_data_o.1D.1484); # USE = anything # CLB = anything GOMP_parallel_endD.1049 (); ... On trunk, we have now: ... # USE = anything # CLB = anything GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0); ... So there's no longer a path in the call graph from main to main._omp_fn. Perhaps a dummy body for GOMP_parallel could fix that. Hm? The IPA PTA solution was to tell IPA PTA that the call to GOMP_parallel doesn't make .omp_data_o escape. The attached patch doesn't work because it only patches GOMP_parallel_start, not GOMP_parallel. Of course it would even better to teach IPA PTA that GOMP_parallel is really invoking main._omp_fn.0 with a .omp_data_o.1 argument. How many different ways of IL do we get doing this kind of indirect function invocations? Other IPA propagators like IPA-CP probably also would like to know this. I see various builtins taking a OMPFN argument in omp-builtins.def. If we assume the GOMP runtime itself is transparent then do we know how the builtins end up calling the actual implementation function?
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #16 from vries at gcc dot gnu.org --- (In reply to Richard Biener from comment #12) (In reply to vries from comment #11) The ipa-pta solution no longer works. In 4.6, we had: ... # USE = anything # CLB = anything GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 0); # USE = anything # CLB = anything main._omp_fn.0D.1472 (.omp_data_o.1D.1484); # USE = anything # CLB = anything GOMP_parallel_endD.1049 (); ... On trunk, we have now: ... # USE = anything # CLB = anything GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0); ... So there's no longer a path in the call graph from main to main._omp_fn. Perhaps a dummy body for GOMP_parallel could fix that. Hm? The IPA PTA solution was to tell IPA PTA that the call to GOMP_parallel [ GOMP_parallel_start ] doesn't make .omp_data_o escape. Right, for 4.6, adding fnspec .rw to GOMP_parallel_start has this effect in ipa-pta: ... D.1505_14 = { ESCAPED NONLOCAL pData } D.1509_18 = { ESCAPED NONLOCAL results } -- D.1505_14 = { pData } D.1509_18 = { results } ... where _14 and _18 are the omp_data_i relative loads in the split-off function: ... # VUSE .MEMD.1514_20 # PT = nonlocal D.1505_14 = .omp_data_iD.1474_13(D)-pDataD.1477; # VUSE .MEMD.1514_20 D.1506_15 = *D.1505_14[idxD.1495_1]; ... # VUSE .MEMD.1514_20 # PT = nonlocal D.1509_18 = .omp_data_iD.1474_13(D)-resultsD.1479; # .MEMD.1514_22 = VDEF .MEMD.1514_20 *D.1509_18[idxD.1495_1] = D.1508_17; ... The attached patch doesn't work because it only patches GOMP_parallel_start, not GOMP_parallel. [ GOMP_parallel_start is no longer around on trunk. ] Applying the 4.6 patch on trunk (and dropping the loop in the hunk for intra_create_variable_infos that does not apply cleanly anymore) and applying fnspec .rw on GOMP_parallel, gives us in ipa-pta: ... _17 = { } _21 = { } ... where _17 and _21 are the omp_data_i relative loads in the split-off function: ... # VUSE .MEM_4 # PT = nonlocal escaped _17 = MEM[(struct .omp_data_s.0D.1713 ).omp_data_i_16(D) clique 1 base 1].pDataD.1719; # VUSE .MEM_4 _18 = *_17[idx_1]; # VUSE .MEM_4 # PT = nonlocal escaped _21 = MEM[(struct .omp_data_s.0D.1713 ).omp_data_i_16(D) clique 1 base 1].resultsD.1721; # .MEM_22 = VDEF .MEM_4 *_21[idx_1] = _20; ... It is reasonable to assume that we no longer are able to relate back these loads in the split-off function to pData and result in the donor function, due to the fact that there's no longer a direct function call to main._omp_fn in the donor function. On 4.6, that direct function call to main._omp_fn still existed. On trunk, not anymore.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #14 from Jakub Jelinek jakub at gcc dot gnu.org --- (In reply to Richard Biener from comment #13) So there's no longer a path in the call graph from main to main._omp_fn. Perhaps a dummy body for GOMP_parallel could fix that. Hm? The IPA PTA solution was to tell IPA PTA that the call to GOMP_parallel doesn't make .omp_data_o escape. The attached patch doesn't work because it only patches GOMP_parallel_start, not GOMP_parallel. Of course it would even better to teach IPA PTA that GOMP_parallel is really invoking main._omp_fn.0 with a .omp_data_o.1 argument. How many different ways of IL do we get doing this kind of indirect function invocations? Other IPA propagators like IPA-CP probably also would like to know this. I see various builtins taking a OMPFN argument in omp-builtins.def. If we assume the GOMP runtime itself is transparent then do we know how the builtins end up calling the actual implementation function? GOMP_parallel* call the ompfn function (first argument) with the second argument (pointer to some structure filled before GOMP_parallel* and dead (using a clobber) after the call) as the only argument. The callback function can be called just once or more times (once in each thread). Then there is GOMP_task*, where there is one or two callback functions, if just one (the other one is NULL), then either the first callback function (1st argument) is called with the second argument as the only argument, or with a pointer to a memory block that was filled with memcpy from the second argument. If the third argument (second callback) is non-NULL, then that callback is called instead of the memcpy and the pointers can be to two different structures. GOMP_target is another case, but there is often a cross-device boundary in between the two, so it is much harder to model that for IPA-PTA etc. purposes. So, schematically, GOMP_parallel* (fn1, data1, ...) performs: if (somecond) for (...) pthread_create (..., fn1, data1); fn1 (data1); if (somecond) for (...) pthread_join (...); and GOMP_task (fn1, data1, fn2, ...) performs: if (fn2 == 0 somecond1) fn1 (data1); else { char *buf = malloc (...); // or alloca/vla if (fn2 == 0) }
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #15 from Jakub Jelinek jakub at gcc dot gnu.org --- and GOMP_task (fn1, data1, fn2, ...) performs: if (somecond) { if (fn2 == 0) fn1 (data1); else { void *buf = alloca (...); // Takes care also about alignment fn2 (buf, data1); fn1 (buf); } } else { void *buf = malloc (...); // Takes care also about alignment if (fn2 == 0) memcpy (buf, data1, ...); else fn2 (buf, data1); // Arrange for fn1 (buf); to be called at some point later (like C++ futures) } The purpose of fn2 is to run copy constructors of the vars, for vars that will be residing within the buf.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #10 from vries at gcc dot gnu.org --- An observation. A patch like this allows vectorization without alias check: ... diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 8290a65..501d631 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -1241,7 +1241,12 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx) type = build_pointer_type (build_pointer_type (type)); } else if (by_ref) +#if 0 type = build_pointer_type (type); +#else +type = build_qualified_type (build_reference_type (type), +TYPE_QUAL_RESTRICT); +#endif else if ((mask 3) == 1 is_reference (var)) type = TREE_TYPE (type); ... The problem is that we don't have information at this point to decide between pointer and restrict reference. If var would be a scalar, we could use addr_taken to ensure that the var is not aliased. For arrays that doesn't work. If the c frontend would distinguish between: - element read: result[x], and - alias created: result, result, result[x] and store that in an alias_created property, we could use that property to decide between pointer and restrict reference. That would not fix the problem in general though, since that solution would already no longer work if the example was rewritten using pointers. I wonder if postponing omp_expand till after ealias would give us enough information to update the field reference types with a restrict tag (or not) at that point. [ Though I'm not sure if doing that update there would actually have the desired effect. ]
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #11 from vries at gcc dot gnu.org --- The ipa-pta solution no longer works. In 4.6, we had: ... # USE = anything # CLB = anything GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 0); # USE = anything # CLB = anything main._omp_fn.0D.1472 (.omp_data_o.1D.1484); # USE = anything # CLB = anything GOMP_parallel_endD.1049 (); ... On trunk, we have now: ... # USE = anything # CLB = anything GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0); ... So there's no longer a path in the call graph from main to main._omp_fn. Perhaps a dummy body for GOMP_parallel could fix that.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 vries at gcc dot gnu.org changed: What|Removed |Added CC||vries at gcc dot gnu.org --- Comment #9 from vries at gcc dot gnu.org --- Created attachment 33348 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33348action=edit patch for 4.6 branch - patches from comment 1 and 3 - c-common.c patch re-applied to lto/lto-lang.c to fix lto buildbreaker - testcases added. Patch applies to 4.6 branch, non-bootstrap build succeeds and added test-cases pass. In 4.7 branch, this snippet: ... @@ -5612,6 +5611,12 @@ intra_create_variable_infos (void) rhsc.offset = 0; process_constraint (new_constraint (lhsc, rhsc)); vi-is_restrict_var = 1; + do + { + make_constraint_from (vi, nonlocal_id); + vi = vi-next; + } + while (vi); continue; } ... conflicts with: ... rhsc.offset = 0; process_constraint (new_constraint (lhsc, rhsc)); for (; vi; vi = vi-next) if (vi-may_have_pointers) { if (vi-only_restrict_pointers) make_constraint_from_global_restrict (vi, GLOBAL_RESTRICT); else make_copy_constraint (vi, nonlocal_id); } continue; } ...
[Bug tree-optimization/46032] openmp inhibits loop vectorization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #8 from Dominique d'Humieres dominiq at lps dot ens.fr --- See also pr60997.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 Feng Chen fchen at gmail dot com changed: What|Removed |Added CC||fchen at gmail dot com --- Comment #7 from Feng Chen fchen at gmail dot com 2012-07-06 16:17:28 UTC --- Any update on this? I do see loops getting slower even for large nx*ny sometimes after omp on gcc 4.6.2, e.g., #pragma omp parallel for for(int iy=0; iyny; iy++) { for(int ix=0; ixnx; ix++) { dest[(size_t)iy*nx + ix] = src[(size_t)iy*nx + ix] * 2; } } Sometimes gcc won't vectorize the inner loop, i have to put it into an inline function to force it. The performance is only marginally better after that. ps: I break the loop because I noticed previously that omp parallel inhibits auto-vectorization, forgot which gcc version I used ... Graphite did improve the scalability of openmp programs from my experience, so the fix (with tests) is important ... (In reply to comment #6) Good. But it Graphite breaks it, let's add Sebastian in CC..
[Bug tree-optimization/46032] openmp inhibits loop vectorization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #5 from vincenzo Innocente vincenzo.innocente at cern dot ch 2011-07-26 13:00:18 UTC --- in case anybody wandering it seems fixed in Using built-in specs. COLLECT_GCC=c++ COLLECT_LTO_WRAPPER=/afs/cern.ch/user/i/innocent/w2/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ./configure --prefix=/afs/cern.ch/user/i/innocent/w2 --enable-languages=c,c++,fortran -enable-gold=yes --enable-lto --with-build-config=bootstrap-lto --with-gmp-lib=/usr/local/lib64 --with-mpfr-lib=/usr/local/lib64 -with-mpc-lib=/usr/local/lib64 --enable-cloog-backend=isl --with-cloog=/usr/local --with-ppl-lib=/usr/local/lib64 CFLAGS='-O2 -ftree-vectorize -fPIC' CXXFLAGS='-O2 -fPIC -ftree-vectorize -fvisibility-inlines-hidden' Thread model: posix gcc version 4.7.0 20110725 (experimental) (GCC) c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp -ftree-vectorizer-verbose=7 -fopenmp openmpvector.cpp:11: note: versioning for alias required: can't determine dependence between *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: mark for run-time aliasing test between *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: versioning for alias required: can't determine dependence between .omp_data_i_14(D)-coeff and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: mark for run-time aliasing test between .omp_data_i_14(D)-coeff and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_32 openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_34 openmpvector.cpp:11: note: Vectorizing an unaligned access. openmpvector.cpp:11: note: Vectorizing an unaligned access. openmpvector.cpp:11: note: Vectorizing an unaligned access. openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by hardware. openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by hardware. openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . openmpvector.cpp:11: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 . openmpvector.cpp:11: note: vect_model_store_cost: unaligned supported by hardware. openmpvector.cpp:11: note: vect_model_store_cost: inside_cost = 2, outside_cost = 0 . openmpvector.cpp:11: note: cost model: Adding cost of checks for loop versioning aliasing. openmpvector.cpp:11: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown . openmpvector.cpp:11: note: Cost model analysis: Vector inside of loop cost: 7 Vector outside of loop cost: 19 Scalar iteration cost: 4 Scalar outside cost: 1 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 7 openmpvector.cpp:11: note: Profitability threshold = 6 openmpvector.cpp:11: note: Profitability threshold is 6 loop iterations. openmpvector.cpp:11: note: create runtime check for data references *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: create runtime check for data references .omp_data_i_14(D)-coeff and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: created 2 versioning for alias checks. openmpvector.cpp:11: note: LOOP VECTORIZED. openmpvector.cpp:9: note: vectorized 1 loops in function. graphite breaks it…. c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp -ftree-vectorizer-verbose=7 -fopenmp -fgraphite -fgraphite-identity -floop-block -floop-flatten -floop-interchange -floop-strip-mine -ftree-loop-linear -floop-parallelize-all openmpvector.cpp:9: note: not vectorized: data ref analysis failed D.2372_47 = *pretmp.11_32[D.2403_49]; openmpvector.cpp:9: note: vectorized 0 loops in function.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 Paolo Carlini paolo.carlini at oracle dot com changed: What|Removed |Added CC||spop at gcc dot gnu.org --- Comment #6 from Paolo Carlini paolo.carlini at oracle dot com 2011-07-26 13:47:35 UTC --- Good. But it Graphite breaks it, let's add Sebastian in CC..
[Bug tree-optimization/46032] openmp inhibits loop vectorization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2010.10.15 10:08:39 CC||jakub at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Ever Confirmed|0 |1 --- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-15 10:08:39 UTC --- The problem is that local variables are accessed indirectly via the .omp_data_i pointer and alias analysis is unable to hoist the load of .omp_data_i_12(D)-coeff across the store to *pretmp.5_27[idx_1]. A fix is to make the argument DECL_BY_REFERENCE and the type restrict qualified. This will make alias analysis assume that the pointed-to object is not aliased unless later somebody takes its address. bb 3: pretmp.5_23 = .omp_data_i_12(D)-pData; pretmp.5_27 = .omp_data_i_12(D)-results; bb 4: # idx_1 = PHI idx_8(3), idx_18(5) D.2142_14 = *pretmp.5_23[idx_1]; D.2143_15 = .omp_data_i_12(D)-coeff; D.2144_16 = D.2142_14 * D.2143_15; *pretmp.5_27[idx_1] = D.2144_16; idx_18 = idx_1 + 1; if (D.2139_10 idx_18) goto bb 5; else goto bb 6; bb 5: goto bb 4; Not completely enough though, as we consider *.omp_data_i escaped (and thus reachable by NONLOCAL). The following fixes that (with unknown consequences, I think fortran array descriptors are the only other user): Index: gcc/omp-low.c === --- gcc/omp-low.c (revision 165474) +++ gcc/omp-low.c (working copy) @@ -1349,7 +1349,8 @@ fixup_child_record_type (omp_context *ct layout_type (type); } - TREE_TYPE (ctx-receiver_decl) = build_pointer_type (type); + TREE_TYPE (ctx-receiver_decl) = build_qualified_type (build_pointer_type (type), +TYPE_QUAL_RESTRICT); } /* Instantiate decls as necessary in CTX to satisfy the data sharing @@ -1584,6 +1585,7 @@ create_omp_child_function (omp_context * DECL_NAMELESS (t) = 1; DECL_ARG_TYPE (t) = ptr_type_node; DECL_CONTEXT (t) = current_function_decl; + DECL_BY_REFERENCE (t) = 1; TREE_USED (t) = 1; DECL_ARGUMENTS (decl) = t; if (!task_copy) Index: gcc/tree-ssa-structalias.c === --- gcc/tree-ssa-structalias.c (revision 165474) +++ gcc/tree-ssa-structalias.c (working copy) @@ -5575,7 +5575,6 @@ intra_create_variable_infos (void) var_ann_t ann; heapvar = create_tmp_var_raw (TREE_TYPE (TREE_TYPE (t)), PARM_NOALIAS); - DECL_EXTERNAL (heapvar) = 1; heapvar_insert (t, 0, heapvar); ann = get_var_ann (heapvar); ann-is_heapvar = 1; @@ -5590,6 +5589,12 @@ intra_create_variable_infos (void) rhsc.offset = 0; process_constraint (new_constraint (lhsc, rhsc)); vi-is_restrict_var = 1; + do + { + make_constraint_from (vi, nonlocal_id); + vi = vi-next; + } + while (vi); continue; } it means that stores to *.omp_data_i in the omp fn are considered not escaping to the caller (and thus can be DSEd). With the above patch the loop is vectorized with a runtime alias check, as we can't see that results and pData do not alias. Not even with IPA-PTA as the OMP function escapes through __builtin_GOMP_parallel_start.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #2 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-15 10:30:42 UTC --- If I hack PTA to make the omp function not escape IPA-PTA computes bb 4: # idx_1 = PHI idx_11(3), idx_18(6) # PT = { D.2069 } D.2112_13 = .omp_data_i_12(D)-pData; D.2113_14 = *D.2112_13[idx_1]; D.2114_15 = .omp_data_i_12(D)-coeff; D.2115_16 = D.2113_14 * D.2114_15; # PT = { D.2068 } D.2116_17 = .omp_data_i_12(D)-results; thus knows what the pointers point to and we vectorize w/o a runtime alias check (we still have no idea about alignment though, but that's probably correct). Thus it might be worth annotating some of the OMP builtins with the fnspec attribute.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #3 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-15 11:51:58 UTC --- Created attachment 22053 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=22053 fnspec attr test Like this (ugh). Fixes the thing with -fipa-pta on trunk.
[Bug tree-optimization/46032] openmp inhibits loop vectorization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 --- Comment #4 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-15 12:09:38 UTC --- A few things to consider: __builtin_GOMP_parallel_start (main._omp_fn.0, .omp_data_o.1, 0); main._omp_fn.0 (.omp_data_o.1); __builtin_GOMP_parallel_end (); for PTA purposes we can ignore that __builtin_GOMP_parallel_start calls main._omp_fn.0 and I suppose the function pointer doesn't escape through it. We can't assume that .omp_data_o.1 does not escape through __builtin_GOMP_parallel_start though, as __builtin_GOMP_parallel_end needs to be a barrier for optimization for it (and thus needs to be considered reading and writing .omp_data_o.1). As it doesn't take any arguments the only way to ensure that is by making .omp_data_o.1 escape. We could probably arrange for __builtin_GOMP_parallel_end to get .omp_data_o.1 as argument solely for alias-analysis purposes though. In that case we could use .xw for __builtin_GOMP_parallel_start and .w for __builtin_GOMP_parallel_end.