[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-11-30 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #23 from vries at gcc dot gnu.org ---
Author: vries
Date: Mon Nov 30 16:34:26 2015
New Revision: 231076

URL: https://gcc.gnu.org/viewcvs?rev=231076=gcc=rev
Log:
Handle BUILT_IN_GOMP_PARALLEL in ipa-pta

2015-11-30  Tom de Vries  

PR tree-optimization/46032
* tree-ssa-structalias.c (find_func_aliases_for_call_arg): New
function,
factored out of ...
(find_func_aliases_for_call): ... here.
(find_func_aliases_for_builtin_call, find_func_clobbers): Handle
BUILT_IN_GOMP_PARALLEL.
(ipa_pta_execute): Same.  Handle node->parallelized_function as a local
function.

* gcc.dg/pr46032.c: New test.

* testsuite/libgomp.c/pr46032.c: New test.

Added:
trunk/gcc/testsuite/gcc.dg/pr46032.c
trunk/libgomp/testsuite/libgomp.c/pr46032.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-structalias.c
trunk/libgomp/ChangeLog

[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-11-30 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

vries at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED
   Assignee|unassigned at gcc dot gnu.org  |vries at gcc dot gnu.org

--- Comment #24 from vries at gcc dot gnu.org ---
patch with testcase committed, marking resolved-fixed.

[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-11-30 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

vries at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords||missed-optimization, patch
   Severity|major   |enhancement

--- Comment #22 from vries at gcc dot gnu.org ---
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03448.html

[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-11-19 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #20 from vries at gcc dot gnu.org ---
This patch seems to have the desired effect on the original testcase: 
...
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 830db75..996756b 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -9361,6 +9361,7 @@ expand_omp_for_static_nochunk (struct omp_region *region,
   if (collapse_bb == NULL)
loop->latch = cont_bb;
   add_loop (loop, body_bb->loop_father);
+  loop->safelen = INT_MAX;
 }
 }
...

AFAIU, adding the omp for to the loop is an assertion that the loop is
independent. It seems reasonable to assume that if the original loop was
independent, the loop operating on a slice of the original iteration space will
be independent as well.

[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-11-19 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #21 from Jakub Jelinek  ---
(In reply to vries from comment #20)
> This patch seems to have the desired effect on the original testcase: 
> ...
> diff --git a/gcc/omp-low.c b/gcc/omp-low.c
> index 830db75..996756b 100644
> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -9361,6 +9361,7 @@ expand_omp_for_static_nochunk (struct omp_region
> *region,
>if (collapse_bb == NULL)
> loop->latch = cont_bb;
>add_loop (loop, body_bb->loop_father);
> +  loop->safelen = INT_MAX;
>  }
>  }
> ...
> 
> AFAIU, adding the omp for to the loop is an assertion that the loop is
> independent. It seems reasonable to assume that if the original loop was
> independent, the loop operating on a slice of the original iteration space
> will be independent as well.

That is very much wrong.  Static scheduling, both nochunk and chunk, doesn't
imply in any way that the iterations are independent, the OpenMP standard says
how the work is split among the threads, with nochunk that threads get
consecutive sets of iterations as one chunk that are approximately the same
size, but eventhough it is not exactly specified how exactly the iteration
space is deviced (for nochunk), if you make the loop iterations independent,
you would break many observable properties (say through threadprivate vars,
omp_get_thread_num etc.).
Note loop->safelen == INT_MAX is actually weaker than independent iterations,
when loop->safelen == INT_MAX, there can be dependencies, but only of certain
kinds, it says that it is equivalent if you run the loop normally and if you
run simultaneously (or emulated) the first statements of all the iterations,
then second statements and so on (so vectorize with any vectorization factor
the compiler wants).

[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #17 from Richard Biener rguenth at gcc dot gnu.org ---
(In reply to vries from comment #16)
 (In reply to Richard Biener from comment #12)
  (In reply to vries from comment #11)
   The ipa-pta solution no longer works. In 4.6, we had:
   ...
 # USE = anything
 # CLB = anything
 GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 
   0);
 # USE = anything
 # CLB = anything
 main._omp_fn.0D.1472 (.omp_data_o.1D.1484);
 # USE = anything
 # CLB = anything
 GOMP_parallel_endD.1049 ();
   ...
   
   On trunk, we have now:
   ...
 # USE = anything
 # CLB = anything
 GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0);
   ...
   
   So there's no longer a path in the call graph from main to main._omp_fn.
   Perhaps a dummy body for GOMP_parallel could fix that.
  
  Hm?  The IPA PTA solution was to tell IPA PTA that the call to
  GOMP_parallel
 
 [ GOMP_parallel_start ]
 
  doesn't make .omp_data_o escape.
  
 
 Right, for 4.6, adding fnspec .rw to GOMP_parallel_start has this effect
 in ipa-pta:
 ...
 D.1505_14 = { ESCAPED NONLOCAL pData }
 D.1509_18 = { ESCAPED NONLOCAL results }
 --
 D.1505_14 = { pData }
 D.1509_18 = { results }
 ...
 
 where _14 and _18 are the omp_data_i relative loads in the split-off
 function:
 ...
   # VUSE .MEMD.1514_20
   # PT = nonlocal
   D.1505_14 = .omp_data_iD.1474_13(D)-pDataD.1477;
 
   # VUSE .MEMD.1514_20
   D.1506_15 = *D.1505_14[idxD.1495_1];
 
   ...
 
   # VUSE .MEMD.1514_20
   # PT = nonlocal
   D.1509_18 = .omp_data_iD.1474_13(D)-resultsD.1479;
 
   # .MEMD.1514_22 = VDEF .MEMD.1514_20
   *D.1509_18[idxD.1495_1] = D.1508_17;
 ...
 
 
  The attached patch doesn't work because it only patches GOMP_parallel_start,
  not GOMP_parallel.
  
 
 [ GOMP_parallel_start is no longer around on trunk. ] Applying the 4.6 patch
 on trunk (and dropping the loop in the hunk for intra_create_variable_infos
 that does not apply cleanly anymore) and applying fnspec .rw on
 GOMP_parallel, gives us in ipa-pta:
 ...
 _17 = { }
 _21 = { }
 ...
 
 where _17 and _21 are the omp_data_i relative loads in the split-off
 function:
 ...
   # VUSE .MEM_4
   # PT = nonlocal escaped
   _17 = MEM[(struct .omp_data_s.0D.1713 ).omp_data_i_16(D) clique 1 base
 1].pDataD.1719;
 
   # VUSE .MEM_4
   _18 = *_17[idx_1];
 
   # VUSE .MEM_4
   # PT = nonlocal escaped
   _21 = MEM[(struct .omp_data_s.0D.1713 ).omp_data_i_16(D) clique 1 base
 1].resultsD.1721;
 
   # .MEM_22 = VDEF .MEM_4
   *_21[idx_1] = _20;
 ...
 
 It is reasonable to assume that we no longer are able to relate back these
 loads in the split-off function to pData and result in the donor function,
 due to the fact that there's no longer a direct function call to
 main._omp_fn in the donor function.
 
 On 4.6, that direct function call to main._omp_fn still existed. On trunk,
 not anymore.

In fact it even looks like wrong IPA PTA results to me (_17 and _21 point to
nothing).

Index: gcc/tree-ssa-structalias.c
===
--- gcc/tree-ssa-structalias.c  (revision 223737)
+++ gcc/tree-ssa-structalias.c  (working copy)
@@ -7372,7 +7372,8 @@ ipa_pta_execute (void)
 constraints for parameters.  */
   if (node-used_from_other_partition
  || node-externally_visible
- || node-force_output)
+ || node-force_output
+ || node-address_taken)
{
  intra_create_variable_infos (func);

fixes that.  Of course that makes a solution handling the OMP builtins
specially not going to work as if the function has its address taken
we don't know whether it is called from anywhere else.  The fix is
required for correctness though.

_17 = { ESCAPED NONLOCAL }
_21 = { ESCAPED NONLOCAL }

  _18 = *_17[idx_3];
  *_21[idx_3] = _20;

handling the OMP builtin specially will only get the solution amended
to { ESCAPED NONLOCAL results } and { ESCAPED NONLOCAL pData } and
thus still conflict.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-27 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #19 from rguenther at suse dot de rguenther at suse dot de ---
On Wed, 27 May 2015, jakub at gcc dot gnu.org wrote:

 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032
 
 --- Comment #18 from Jakub Jelinek jakub at gcc dot gnu.org ---
 The *.omp_fn.* functions indeed, while they necessarily have to be 
 addressable,
 because that is how they are passed to the libgomp entrypoints, are never
 called by anything but the libgomp runtime.  For GOMP_parallel*, they are only
 called before the GOMP_parallel* function exits, for GOMP_task* they could be
 called at some later point.

Ok, so this just means that IPA PTA would need to handle those specially
(and thus the OMP functions should be marked specially in the cgraph
node).  Not that I think IPA PTA is anywhere near production ready
(or I have time to fix it up properly...).  Just testing the addressable
fix now.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-27 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #18 from Jakub Jelinek jakub at gcc dot gnu.org ---
The *.omp_fn.* functions indeed, while they necessarily have to be addressable,
because that is how they are passed to the libgomp entrypoints, are never
called by anything but the libgomp runtime.  For GOMP_parallel*, they are only
called before the GOMP_parallel* function exits, for GOMP_task* they could be
called at some later point.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #12 from Richard Biener rguenth at gcc dot gnu.org ---
(In reply to vries from comment #11)
 The ipa-pta solution no longer works. In 4.6, we had:
 ...
   # USE = anything
   # CLB = anything
   GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 0);
   # USE = anything
   # CLB = anything
   main._omp_fn.0D.1472 (.omp_data_o.1D.1484);
   # USE = anything
   # CLB = anything
   GOMP_parallel_endD.1049 ();
 ...
 
 On trunk, we have now:
 ...
   # USE = anything
   # CLB = anything
   GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0);
 ...
 
 So there's no longer a path in the call graph from main to main._omp_fn.
 Perhaps a dummy body for GOMP_parallel could fix that.

Hm?  The IPA PTA solution was to tell IPA PTA that the call to GOMP_parallel
doesn't make .omp_data_o escape.

The attached patch doesn't work because it only patches GOMP_parallel_start,
not GOMP_parallel.

Of course it would even better to teach IPA PTA that GOMP_parallel
is really invoking main._omp_fn.0 with a .omp_data_o.1 argument.

How many different ways of IL do we get doing this kind of indirect
function invocations?


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

Richard Biener rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #13 from Richard Biener rguenth at gcc dot gnu.org ---
(In reply to Richard Biener from comment #12)
 (In reply to vries from comment #11)
  The ipa-pta solution no longer works. In 4.6, we had:
  ...
# USE = anything
# CLB = anything
GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 0);
# USE = anything
# CLB = anything
main._omp_fn.0D.1472 (.omp_data_o.1D.1484);
# USE = anything
# CLB = anything
GOMP_parallel_endD.1049 ();
  ...
  
  On trunk, we have now:
  ...
# USE = anything
# CLB = anything
GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0);
  ...
  
  So there's no longer a path in the call graph from main to main._omp_fn.
  Perhaps a dummy body for GOMP_parallel could fix that.
 
 Hm?  The IPA PTA solution was to tell IPA PTA that the call to
 GOMP_parallel
 doesn't make .omp_data_o escape.
 
 The attached patch doesn't work because it only patches GOMP_parallel_start,
 not GOMP_parallel.
 
 Of course it would even better to teach IPA PTA that GOMP_parallel
 is really invoking main._omp_fn.0 with a .omp_data_o.1 argument.
 
 How many different ways of IL do we get doing this kind of indirect
 function invocations?

Other IPA propagators like IPA-CP probably also would like to know this.

I see various builtins taking a OMPFN argument in omp-builtins.def.  If we
assume the GOMP runtime itself is transparent then do we know how the
builtins end up calling the actual implementation function?


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-26 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #16 from vries at gcc dot gnu.org ---
(In reply to Richard Biener from comment #12)
 (In reply to vries from comment #11)
  The ipa-pta solution no longer works. In 4.6, we had:
  ...
# USE = anything
# CLB = anything
GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 0);
# USE = anything
# CLB = anything
main._omp_fn.0D.1472 (.omp_data_o.1D.1484);
# USE = anything
# CLB = anything
GOMP_parallel_endD.1049 ();
  ...
  
  On trunk, we have now:
  ...
# USE = anything
# CLB = anything
GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0);
  ...
  
  So there's no longer a path in the call graph from main to main._omp_fn.
  Perhaps a dummy body for GOMP_parallel could fix that.
 
 Hm?  The IPA PTA solution was to tell IPA PTA that the call to
 GOMP_parallel

[ GOMP_parallel_start ]

 doesn't make .omp_data_o escape.
 

Right, for 4.6, adding fnspec .rw to GOMP_parallel_start has this effect in
ipa-pta:
...
D.1505_14 = { ESCAPED NONLOCAL pData }
D.1509_18 = { ESCAPED NONLOCAL results }
--
D.1505_14 = { pData }
D.1509_18 = { results }
...

where _14 and _18 are the omp_data_i relative loads in the split-off function:
...
  # VUSE .MEMD.1514_20
  # PT = nonlocal
  D.1505_14 = .omp_data_iD.1474_13(D)-pDataD.1477;

  # VUSE .MEMD.1514_20
  D.1506_15 = *D.1505_14[idxD.1495_1];

  ...

  # VUSE .MEMD.1514_20
  # PT = nonlocal
  D.1509_18 = .omp_data_iD.1474_13(D)-resultsD.1479;

  # .MEMD.1514_22 = VDEF .MEMD.1514_20
  *D.1509_18[idxD.1495_1] = D.1508_17;
...


 The attached patch doesn't work because it only patches GOMP_parallel_start,
 not GOMP_parallel.
 

[ GOMP_parallel_start is no longer around on trunk. ] Applying the 4.6 patch on
trunk (and dropping the loop in the hunk for intra_create_variable_infos that
does not apply cleanly anymore) and applying fnspec .rw on GOMP_parallel,
gives us in ipa-pta:
...
_17 = { }
_21 = { }
...

where _17 and _21 are the omp_data_i relative loads in the split-off function:
...
  # VUSE .MEM_4
  # PT = nonlocal escaped
  _17 = MEM[(struct .omp_data_s.0D.1713 ).omp_data_i_16(D) clique 1 base
1].pDataD.1719;

  # VUSE .MEM_4
  _18 = *_17[idx_1];

  # VUSE .MEM_4
  # PT = nonlocal escaped
  _21 = MEM[(struct .omp_data_s.0D.1713 ).omp_data_i_16(D) clique 1 base
1].resultsD.1721;

  # .MEM_22 = VDEF .MEM_4
  *_21[idx_1] = _20;
...

It is reasonable to assume that we no longer are able to relate back these
loads in the split-off function to pData and result in the donor function, due
to the fact that there's no longer a direct function call to main._omp_fn in
the donor function.

On 4.6, that direct function call to main._omp_fn still existed. On trunk, not
anymore.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-26 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #14 from Jakub Jelinek jakub at gcc dot gnu.org ---
(In reply to Richard Biener from comment #13)
   So there's no longer a path in the call graph from main to main._omp_fn.
   Perhaps a dummy body for GOMP_parallel could fix that.
  
  Hm?  The IPA PTA solution was to tell IPA PTA that the call to
  GOMP_parallel
  doesn't make .omp_data_o escape.
  
  The attached patch doesn't work because it only patches GOMP_parallel_start,
  not GOMP_parallel.
  
  Of course it would even better to teach IPA PTA that GOMP_parallel
  is really invoking main._omp_fn.0 with a .omp_data_o.1 argument.
  
  How many different ways of IL do we get doing this kind of indirect
  function invocations?
 
 Other IPA propagators like IPA-CP probably also would like to know this.
 
 I see various builtins taking a OMPFN argument in omp-builtins.def.  If we
 assume the GOMP runtime itself is transparent then do we know how the
 builtins end up calling the actual implementation function?

GOMP_parallel* call the ompfn function (first argument) with the second
argument (pointer to some structure filled before GOMP_parallel* and dead
(using a clobber) after the call) as the only argument.  The callback function
can be called just once or more times (once in each thread).
Then there is GOMP_task*, where there is one or two callback functions,
if just one (the other one is NULL), then either the first callback function
(1st argument) is called with the second argument as the only argument, or
with a pointer to a memory block that was filled with memcpy from the second
argument.  If the third argument (second callback) is non-NULL, then that
callback is called instead of the memcpy and the pointers can be to two
different structures.
GOMP_target is another case, but there is often a cross-device boundary in
between the two, so it is much harder to model that for IPA-PTA etc. purposes.
So, schematically, GOMP_parallel* (fn1, data1, ...) performs:
if (somecond)
  for (...)
pthread_create (..., fn1, data1);
fn1 (data1);
if (somecond)
  for (...)
pthread_join (...);
and GOMP_task (fn1, data1, fn2, ...) performs:
if (fn2 == 0  somecond1)
  fn1 (data1);
else
  {
char *buf = malloc (...); // or alloca/vla
if (fn2 == 0)

  }


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-26 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #15 from Jakub Jelinek jakub at gcc dot gnu.org ---
and GOMP_task (fn1, data1, fn2, ...) performs:
if (somecond)
  {
if (fn2 == 0)
  fn1 (data1);
else
  {
void *buf = alloca (...); // Takes care also about alignment
fn2 (buf, data1);
fn1 (buf);
  }
  }
else
  {
void *buf = malloc (...); // Takes care also about alignment
if (fn2 == 0)
  memcpy (buf, data1, ...);
else
  fn2 (buf, data1);
// Arrange for fn1 (buf); to be called at some point later (like C++
futures)
  }
The purpose of fn2 is to run copy constructors of the vars, for vars that will
be residing within the buf.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-24 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #10 from vries at gcc dot gnu.org ---
An observation. A patch like this allows vectorization without alias check:
...
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 8290a65..501d631 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1241,7 +1241,12 @@ install_var_field (tree var, bool by_ref, int mask,
omp_context *ctx)
   type = build_pointer_type (build_pointer_type (type));
 }
   else if (by_ref)
+#if 0
 type = build_pointer_type (type);
+#else
+type = build_qualified_type (build_reference_type (type),
+TYPE_QUAL_RESTRICT);
+#endif
   else if ((mask  3) == 1  is_reference (var))
 type = TREE_TYPE (type);

...

The problem is that we don't have information at this point to decide between
pointer and restrict reference. If var would be a scalar, we could use
addr_taken to ensure that the var is not aliased. For arrays that doesn't work.
If the c frontend would distinguish between:
- element read: result[x], and
- alias created: result, result, result[x]
and store that in an alias_created property, we could use that property to
decide between pointer and restrict reference.

That would not fix the problem in general though, since that solution would
already no longer work if the example was rewritten using pointers.

I wonder if postponing omp_expand till after ealias would give us enough
information to update the field reference types with a restrict tag (or not) at
that point. [ Though I'm not sure if doing that update there would actually
have the desired effect. ]


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2015-05-24 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #11 from vries at gcc dot gnu.org ---
The ipa-pta solution no longer works. In 4.6, we had:
...
  # USE = anything
  # CLB = anything
  GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, .omp_data_o.1D.1484, 0);
  # USE = anything
  # CLB = anything
  main._omp_fn.0D.1472 (.omp_data_o.1D.1484);
  # USE = anything
  # CLB = anything
  GOMP_parallel_endD.1049 ();
...

On trunk, we have now:
...
  # USE = anything
  # CLB = anything
  GOMP_parallelD.1345 (main._omp_fn.0D.1844, .omp_data_o.1D.1856, 0, 0);
...

So there's no longer a path in the call graph from main to main._omp_fn.
Perhaps a dummy body for GOMP_parallel could fix that.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2014-08-18 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

vries at gcc dot gnu.org changed:

   What|Removed |Added

 CC||vries at gcc dot gnu.org

--- Comment #9 from vries at gcc dot gnu.org ---
Created attachment 33348
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33348action=edit
patch for 4.6 branch

- patches from comment 1 and 3
- c-common.c patch re-applied to lto/lto-lang.c to fix lto buildbreaker
- testcases added.

Patch applies to 4.6 branch, non-bootstrap build succeeds and added test-cases
pass.

In 4.7 branch, this snippet:
...
@@ -5612,6 +5611,12 @@ intra_create_variable_infos (void)
  rhsc.offset = 0;
  process_constraint (new_constraint (lhsc, rhsc));
  vi-is_restrict_var = 1;
+ do
+   {
+ make_constraint_from (vi, nonlocal_id);
+ vi = vi-next;
+   }
+ while (vi);
  continue;
}
... 

conflicts with: 
...
  rhsc.offset = 0;
  process_constraint (new_constraint (lhsc, rhsc));
  for (; vi; vi = vi-next)
if (vi-may_have_pointers)
  {
if (vi-only_restrict_pointers)
  make_constraint_from_global_restrict (vi, GLOBAL_RESTRICT);
else
  make_copy_constraint (vi, nonlocal_id);
  }
  continue;
}
...


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2014-04-29 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #8 from Dominique d'Humieres dominiq at lps dot ens.fr ---
See also pr60997.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2012-07-06 Thread fchen0000 at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

Feng Chen fchen at gmail dot com changed:

   What|Removed |Added

 CC||fchen at gmail dot com

--- Comment #7 from Feng Chen fchen at gmail dot com 2012-07-06 16:17:28 
UTC ---
Any update on this? I do see loops getting slower even for large nx*ny
sometimes after omp on gcc 4.6.2, e.g.,

#pragma omp parallel for
for(int iy=0; iyny; iy++) {
  for(int ix=0; ixnx; ix++) {
dest[(size_t)iy*nx + ix] = src[(size_t)iy*nx + ix] * 2;
  }
}

Sometimes gcc won't vectorize the inner loop, i have to put it into an inline
function to force it.  The performance is only marginally better after that.
ps: I break the loop because I noticed previously that omp parallel inhibits
auto-vectorization, forgot which gcc version I used ...

Graphite did improve the scalability of openmp programs from my experience, so
the fix (with tests) is important ...

(In reply to comment #6)
 Good. But it Graphite breaks it, let's add Sebastian in CC..


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2011-07-26 Thread vincenzo.innocente at cern dot ch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #5 from vincenzo Innocente vincenzo.innocente at cern dot ch 
2011-07-26 13:00:18 UTC ---
in case anybody wandering
it seems fixed in
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/afs/cern.ch/user/i/innocent/w2/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ./configure --prefix=/afs/cern.ch/user/i/innocent/w2
--enable-languages=c,c++,fortran -enable-gold=yes --enable-lto
--with-build-config=bootstrap-lto --with-gmp-lib=/usr/local/lib64
--with-mpfr-lib=/usr/local/lib64 -with-mpc-lib=/usr/local/lib64
--enable-cloog-backend=isl --with-cloog=/usr/local
--with-ppl-lib=/usr/local/lib64 CFLAGS='-O2 -ftree-vectorize -fPIC'
CXXFLAGS='-O2 -fPIC -ftree-vectorize -fvisibility-inlines-hidden'
Thread model: posix
gcc version 4.7.0 20110725 (experimental) (GCC) 


c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp
-ftree-vectorizer-verbose=7 -fopenmp
openmpvector.cpp:11: note: versioning for alias required: can't determine
dependence between *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: mark for run-time aliasing test between
*pretmp.11_32[idx_3] and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: versioning for alias required: can't determine
dependence between .omp_data_i_14(D)-coeff and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: mark for run-time aliasing test between
.omp_data_i_14(D)-coeff and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_32
openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_34
openmpvector.cpp:11: note: Vectorizing an unaligned access.
openmpvector.cpp:11: note: Vectorizing an unaligned access.
openmpvector.cpp:11: note: Vectorizing an unaligned access.
openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by
hardware.
openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost
= 0 .
openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by
hardware.
openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost
= 0 .
openmpvector.cpp:11: note: vect_model_simple_cost: inside_cost = 1,
outside_cost = 0 .
openmpvector.cpp:11: note: vect_model_store_cost: unaligned supported by
hardware.
openmpvector.cpp:11: note: vect_model_store_cost: inside_cost = 2, outside_cost
= 0 .
openmpvector.cpp:11: note: cost model: Adding cost of checks for loop
versioning aliasing.

openmpvector.cpp:11: note: cost model: epilogue peel iters set to vf/2 because
loop iterations are unknown .
openmpvector.cpp:11: note: Cost model analysis: 
  Vector inside of loop cost: 7
  Vector outside of loop cost: 19
  Scalar iteration cost: 4
  Scalar outside cost: 1
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7

openmpvector.cpp:11: note:   Profitability threshold = 6

openmpvector.cpp:11: note: Profitability threshold is 6 loop iterations.
openmpvector.cpp:11: note: create runtime check for data references
*pretmp.11_32[idx_3] and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: create runtime check for data references
.omp_data_i_14(D)-coeff and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: created 2 versioning for alias checks.

openmpvector.cpp:11: note: LOOP VECTORIZED.
openmpvector.cpp:9: note: vectorized 1 loops in function.



graphite breaks it…. 
c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp
-ftree-vectorizer-verbose=7 -fopenmp -fgraphite -fgraphite-identity
-floop-block -floop-flatten -floop-interchange -floop-strip-mine
-ftree-loop-linear -floop-parallelize-all

openmpvector.cpp:9: note: not vectorized: data ref analysis failed D.2372_47 =
*pretmp.11_32[D.2403_49];

openmpvector.cpp:9: note: vectorized 0 loops in function.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2011-07-26 Thread paolo.carlini at oracle dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

Paolo Carlini paolo.carlini at oracle dot com changed:

   What|Removed |Added

 CC||spop at gcc dot gnu.org

--- Comment #6 from Paolo Carlini paolo.carlini at oracle dot com 2011-07-26 
13:47:35 UTC ---
Good. But it Graphite breaks it, let's add Sebastian in CC..


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2010-10-15 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2010.10.15 10:08:39
 CC||jakub at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
 Ever Confirmed|0   |1

--- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-15 
10:08:39 UTC ---
The problem is that local variables are accessed indirectly via the
.omp_data_i pointer and alias analysis is unable to hoist the load of
.omp_data_i_12(D)-coeff across the store to *pretmp.5_27[idx_1].

A fix is to make the argument DECL_BY_REFERENCE and the
type restrict qualified.  This will  make alias analysis assume that
the pointed-to object is not aliased unless later somebody takes its
address.

bb 3:
  pretmp.5_23 = .omp_data_i_12(D)-pData;
  pretmp.5_27 = .omp_data_i_12(D)-results;

bb 4:
  # idx_1 = PHI idx_8(3), idx_18(5)
  D.2142_14 = *pretmp.5_23[idx_1];
  D.2143_15 = .omp_data_i_12(D)-coeff;
  D.2144_16 = D.2142_14 * D.2143_15;
  *pretmp.5_27[idx_1] = D.2144_16;
  idx_18 = idx_1 + 1;
  if (D.2139_10  idx_18)
goto bb 5;
  else
goto bb 6;

bb 5:
  goto bb 4;


Not completely enough though, as we consider *.omp_data_i escaped
(and thus reachable by NONLOCAL).

The following fixes that (with unknown consequences, I think fortran
array descriptors are the only other user):

Index: gcc/omp-low.c
===
--- gcc/omp-low.c   (revision 165474)
+++ gcc/omp-low.c   (working copy)
@@ -1349,7 +1349,8 @@ fixup_child_record_type (omp_context *ct
   layout_type (type);
 }

-  TREE_TYPE (ctx-receiver_decl) = build_pointer_type (type);
+  TREE_TYPE (ctx-receiver_decl) = build_qualified_type (build_pointer_type
(type),
+TYPE_QUAL_RESTRICT);
 }

 /* Instantiate decls as necessary in CTX to satisfy the data sharing
@@ -1584,6 +1585,7 @@ create_omp_child_function (omp_context *
   DECL_NAMELESS (t) = 1;
   DECL_ARG_TYPE (t) = ptr_type_node;
   DECL_CONTEXT (t) = current_function_decl;
+  DECL_BY_REFERENCE (t) = 1;
   TREE_USED (t) = 1;
   DECL_ARGUMENTS (decl) = t;
   if (!task_copy)
Index: gcc/tree-ssa-structalias.c
===
--- gcc/tree-ssa-structalias.c  (revision 165474)
+++ gcc/tree-ssa-structalias.c  (working copy)
@@ -5575,7 +5575,6 @@ intra_create_variable_infos (void)
  var_ann_t ann;
  heapvar = create_tmp_var_raw (TREE_TYPE (TREE_TYPE (t)),
PARM_NOALIAS);
- DECL_EXTERNAL (heapvar) = 1;
  heapvar_insert (t, 0, heapvar);
  ann = get_var_ann (heapvar);
  ann-is_heapvar = 1;
@@ -5590,6 +5589,12 @@ intra_create_variable_infos (void)
  rhsc.offset = 0;
  process_constraint (new_constraint (lhsc, rhsc));
  vi-is_restrict_var = 1;
+ do
+   {
+ make_constraint_from (vi, nonlocal_id);
+ vi = vi-next;
+   }
+ while (vi);
  continue;
}


it means that stores to *.omp_data_i in the omp fn are considered not
escaping to the caller (and thus can be DSEd).  With the above patch
the loop is vectorized with a runtime alias check, as we can't
see that results and pData do not alias.  Not even with IPA-PTA as
the OMP function escapes through __builtin_GOMP_parallel_start.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2010-10-15 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #2 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-15 
10:30:42 UTC ---
If I hack PTA to make the omp function not escape IPA-PTA computes

bb 4:
  # idx_1 = PHI idx_11(3), idx_18(6)
  # PT = { D.2069 }
  D.2112_13 = .omp_data_i_12(D)-pData;
  D.2113_14 = *D.2112_13[idx_1];
  D.2114_15 = .omp_data_i_12(D)-coeff;
  D.2115_16 = D.2113_14 * D.2114_15;
  # PT = { D.2068 }
  D.2116_17 = .omp_data_i_12(D)-results;

thus knows what the pointers point to and we vectorize w/o a runtime
alias check (we still have no idea about alignment though, but that's
probably correct).

Thus it might be worth annotating some of the OMP builtins with
the fnspec attribute.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2010-10-15 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #3 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-15 
11:51:58 UTC ---
Created attachment 22053
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=22053
fnspec attr test

Like this (ugh).  Fixes the thing with -fipa-pta on trunk.


[Bug tree-optimization/46032] openmp inhibits loop vectorization

2010-10-15 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

--- Comment #4 from Richard Guenther rguenth at gcc dot gnu.org 2010-10-15 
12:09:38 UTC ---
A few things to consider:

  __builtin_GOMP_parallel_start (main._omp_fn.0, .omp_data_o.1, 0);
  main._omp_fn.0 (.omp_data_o.1);
  __builtin_GOMP_parallel_end ();

for PTA purposes we can ignore that __builtin_GOMP_parallel_start calls
main._omp_fn.0 and I suppose the function pointer doesn't escape through
it.  We can't assume that .omp_data_o.1 does not escape through
__builtin_GOMP_parallel_start though, as __builtin_GOMP_parallel_end needs
to be a barrier for optimization for it (and thus needs to be considered
reading and writing .omp_data_o.1).  As it doesn't take any arguments
the only way to ensure that is by making .omp_data_o.1 escape.  We could
probably arrange for __builtin_GOMP_parallel_end to get .omp_data_o.1
as argument solely for alias-analysis purposes though.  In that case
we could use .xw for __builtin_GOMP_parallel_start and .w for
__builtin_GOMP_parallel_end.