[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-03-11 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

   Assignee|alalaw01 at gcc dot gnu.org|unassigned at gcc dot 
gnu.org

--- Comment #88 from alalaw01 at gcc dot gnu.org ---
Can this now be closed, or should I leave open for possible Fortran FE
warnings?

[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.

2016-03-11 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #43 from alalaw01 at gcc dot gnu.org ---
I think this can be closed now? I've raised PR/70189 for the followup
enhancement.

[Bug middle-end/70189] New: Combine constant-pool logic from gimplify + SRA

2016-03-11 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70189

Bug ID: 70189
   Summary: Combine constant-pool logic from gimplify + SRA
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---

Following PR/63679 (r232506), gimplify.c (gimplify_init_constructor) uses lots
of heuristics to choose between pushing initializers out to the constant pool
(by calling tree_output_constant_def) or outputting many elementwise
statements. Then, in tree-sra.c (analyze_all_variable_accesses), we use more
heuristics to decide which constant-pool loads to completely_scalarize, 
turning those back into elementwise statements. (These get pulled back in from
the constant pool and the constant-pool entry deleted.) Both of these sets of
heuristics are platform dependent (gimplify.c uses can_move_by_pieces,
CLEAR_RATIO; tree-sra.c uses get_move_ratio).

Instead we should put all this logic in one place; this would make it clearer,
and we'd probably get better overall decisions. The suggestion is for
gimplify.c to always push out to the constant pool, as this makes initial tree
the same on all platforms, and for all the logic/heuristics to go into SRA (as,
being later, we then have more information available to maybe make better
decisions in the future).

[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization

2016-03-11 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013

--- Comment #13 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Fri Mar 11 12:08:01 2016
New Revision: 234138

URL: https://gcc.gnu.org/viewcvs?rev=234138&root=gcc&view=rev
Log:
Fix PR/70013

gcc:

PR tree-optimization/70013
* tree-sra.c (analyze_access_subtree): Also set grp_unscalarized_data
for constant-pool entries.

gcc/testsuite:

* gcc.dg/tree-ssa/sra-20.c: New.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-20.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-sra.c

[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization

2016-03-11 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013

--- Comment #12 from alalaw01 at gcc dot gnu.org ---
Thanks, Martin - yes, I see.

Patch posted at https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00680.html after
full regtest.

[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop

2016-03-10 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681

--- Comment #8 from alalaw01 at gcc dot gnu.org ---
Indeed, the -DFOO=1 case vectorizes with -fno-tree-dominator-opts.

[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop

2016-03-10 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681

--- Comment #7 from alalaw01 at gcc dot gnu.org ---
Looking at where the peeling happens. In both -DFOO=0 and -DFOO=1 cases,
107.ch2 peels the inner loop header, so there is an i<=max test in the outer
loop before the inner loop. However, in the -DFOO=1 case, this is dominated by
the extra i>max test (that breaks out of the outer loop), so 110.dom2 removes
the peeled i<=max.

Thus, just before sccp, in the -DFOO=0 case, we have:

  :
  # i_25 = PHI 
  # j_26 = PHI 
  max_7 = 1 << j_26;
  if (max_7 >= i_25)
goto ;
  else
goto ; //skip inner loop

  : //inner loop header
  # i_2 = PHI 
  _8 = (long unsigned int) i_2;
  _9 = _8 * 4;
  _11 = data_10(D) + _9;
  _12 = *_11;
  _13 = _12 + j_26;
  *_11 = _13;
  i_15 = i_2 + 1;
  if (max_7 >= i_15)
goto ; //cleaned, actually via latch
  else
goto ;

note the inner loop exits if !(max_7 >= i_15), and when we hit the inner loop,
we know that (max_7 >= i_25). Whereas in the -DFOO=1 case:
  :
  goto ;

  : //in outer loop
  max_7 = 1 << j_17;
  if (max_7 < i_32)
goto ;
  else
goto ;

  : //outer loop header
  # max_24 = PHI 
  # i_22 = PHI 
  # j_23 = PHI 

  : //inner loop header
  # i_27 = PHI 
  _8 = (long unsigned int) i_27;
  _9 = _8 * 4;
  _11 = data_10(D) + _9;
  _13 = *_11;
  _14 = _13 + j_23;
  *_11 = _14;
  i_16 = i_27 + 1;
  if (i_16 <= max_24)
goto ; //cleaned, actually via latch
  else
goto ;

the inner loop exits if !(max_24 >= i_16), but max_24 is defined as PHI, and we only have that max_7max) break" out of the loop, such that the outer loop now executes
"if (i>max) break" after the inner loop (rather than testing "if (i>max) break"
before the inner loop, as it still did following 107.ch2). So as an
alternative, possibly tweaking the jump-threading/loop-peeling heuristics might
help (?).

[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop

2016-03-09 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681

--- Comment #5 from alalaw01 at gcc dot gnu.org ---
In the -DFOO=0 case, we have peeled an extra copy of the inner loop condition,
i <= max_7, above the loop. scalar evolution (final_value_replacement_loop)
works, because it sees the inner loop goes round niter = (unsigned int) max_7 -
(unsigned int) i_25 iterations, and compute_overall_effect_of_inner_loop gives
us

(int) (((unsigned int) i_25 + ((unsigned int) max_7 - (unsigned int) i_25)) +
1)

which is not expression_expensive_p, so we do it. Hence the add/subtract above.

When -DFOO=1, we have not done that peeling, so niter = i_22 <= max_24 ?
(unsigned int) max_24 - (unsigned int) i_22 : 0, and
compute_overall_effect_of_inner_loop gives us

(i_22 + 1) + (i_22 <= max_24 ? (int) ((unsigned int) max_24 - (unsigned int)
i_22) : 0)

which is expression_expensive_p, so we don't do the final value replacement.

[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization

2016-03-09 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013

--- Comment #10 from alalaw01 at gcc dot gnu.org ---
Hmmm, so this fixes the ICE, generating:

  SR.5_12 = MEM[(struct S0[2] *)&*.LC0].f0;
  MEM[(struct S0[2] *)&*.LC0].f0 = SR.5_12;
  d = *.LC0;
  d$3$f0_14 = MEM[(struct S0[2] *)&*.LC0 + 3B].f0;
  d$0$f0_7 = SR.5_12;
  e$f0_9 = d$3$f0_14;
  _3 = (int) d$0$f0_7;
  c = _3;
  _5 = (int) e$f0_9;
  __builtin_printf ("%x\n", _5);
  d ={v} {CLOBBER};
  return 0;

which in -fdump-tree-optimized (at -O1) looks like:

  SR.5_12 = MEM[(struct S0[2] *)&*.LC0].f0;
  d$3$f0_14 = MEM[(struct S0[2] *)&*.LC0 + 3B].f0;
  _3 = (int) SR.5_12;
  c = _3;
  _5 = (int) d$3$f0_14;
  __builtin_printf ("%x\n", _5);
  return 0;

which is much saner. But I don't really understand why the PARM_DECL case that
I'm adding to here is that way (since r147980 "New implementation of SRA" in
2009, https://gcc.gnu.org/ml/gcc-patches/2009-04/msg02218.html)...

Bootstrapped+regtest on AArch64 (c,c++) and ARM (c,c++,ada), no regressions.
(Constants don't get pushed into the pool on x86.)

diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index
72157edd02e3235e57b786bbf460c94b0c52b2c5..24eac6ae7c4dcd41358b1a020047076afe1a8106
100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -2427,7 +2427,8 @@ analyze_access_subtree (struct access *root, struct
access *parent,

   if (!hole || root->grp_total_scalarization)
 root->grp_covered = 1;
-  else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL)
+  else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL
+  || constant_decl_p (root->base))
 root->grp_unscalarized_data = 1; /* not covered and written to */
   return sth_created;
 }

[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop

2016-03-09 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681

--- Comment #4 from alalaw01 at gcc dot gnu.org ---
loopinit introduces the exit phi in much the same way for both -DFOO=0 and
-DFOO=1, so the difference is in sccp.

In the -DFOO=0 case, sccp does this (removing TODO_cleanup_cfg from
pass_data_scev_cprop to make the diff easier, still vectorizes):

 ;; Function addlog2 (addlog2, funcdef_no=0, decl_uid=2749, cgraph_uid=0,
symbol_order=0)

+
+final value replacement:
+  i_21 = PHI 
+  with
+  i_21 = (int) _3;
+
...[snip]...
   :
-  # i_21 = PHI 
+  _19 = (unsigned int) i_25;
+  _18 = (unsigned int) max_7;
+  _17 = (unsigned int) i_25;
+  _5 = _18 - _17;
+  _4 = _5 + _19;
+  _3 = _4 + 1;
+  i_21 = (int) _3;

In the -DFOO=1 case, sccp doesn't do anything; and adding -fno-tree-scev-cprop
prevents vectorization of the -DFOO=0 case.

[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop

2016-03-09 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681

--- Comment #3 from alalaw01 at gcc dot gnu.org ---
So in the not-vectorized case (-DFOO=1), we get for the inner loop:

:
  # i_27 = PHI 
  _8 = (long unsigned int) i_27;
  _9 = _8 * 4;
  _11 = data_10(D) + _9;
  _13 = *_11;
  _14 = _13 + j_23;
  *_11 = _14;
  i_16 = i_27 + 1;
  if (i_16 <= max_24)
goto ;
  else
goto ;

  :
  goto ;

  :
  # i_32 = PHI 

the loop exit phi, i_32=PHI, makes i_16=i_27+1 relevant
(vec_stmt_relevant_p: used out of loop.), so we go through that on the worklist
and then i_27=PHI, marking the phi as STMT_VINFO_LIVE_P, and
hence "not vectorized: value used after loop". Kind of as expected, FORNOW.

In the -DFOO=0 case, a bunch of loop peeling, header-copying, and other
transforms, end up with this input to vectorization:

  : //header of inner loop
  # i_2 = PHI 
  _8 = (long unsigned int) i_2;
  _9 = _8 * 4;
  _11 = data_10(D) + _9;
  _12 = *_11;
  _13 = _12 + j_26;
  *_11 = _13;
  i_15 = i_2 + 1;
  if (max_7 >= i_15)
goto ;
  else
goto ;

  :
  goto ;

  : //bb 5 is only predecessor
  _19 = (unsigned int) i_25;
  _18 = (unsigned int) max_7;
  _17 = (unsigned int) i_25;
  _5 = _18 - _17;
  _4 = _5 + _19;
  _3 = _4 + 1;
  i_21 = (int) _3;

  :
  # i_23 = PHI 
  //tests outer loop

note bb7 use i_25, not i_2; so neither i_15 nor i_2 escape the loop, and we
don't have the problem from above. (Yes bb7 is taking i_25 away from max_7 and
then adding it back on again, before adding 1, to give the value of i after the
inner loop.)

This arrangement of multiple i's live at the same time, is not present in
107t.ch2. 130t.loopinit introduces i_21, computed by an exit phi on leaving the
inner loop. 135t.sccp then changes this to the max_7-i_25+i_25 sequence which
removes the dependency on i_15 and allows vectorization.

[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization

2016-03-07 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013

--- Comment #9 from alalaw01 at gcc dot gnu.org ---
In analyze_access_subtree (since r147980, "New implementation of SRA", 2009):

  else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL)
root->grp_unscalarized_data = 1; /* not covered and written to */

adding a case for constant_decl_p alongside the PARM_DECL case, fixes the ICE;
AArch64 bootstrap in progress.

[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization

2016-03-07 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013

--- Comment #7 from alalaw01 at gcc dot gnu.org ---
*second* half, sorry. grp_to_be_replaced is here true, but
grp_unscalarized_data is false, so handle_unscalarized_data_in_subtree sets
sad->refreshed=UDH_LEFT and we build the access to the LHS. (Then,
load_assign_lhs_subreplacements exits, and the caller sees UDH_LEFT and removes
the original block move statement.)

In contrast, on a similar testcase using a parameter rather than *.LC0,
grp_unscalarized_data is true, handle_unscalarized_data_in_subtree sets
sad->refreshed=UDH_RIGHT and we build an access to the RHS, which is OK; and
leave the block move statement in place, hence correctness.

[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization

2016-03-07 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013

--- Comment #6 from alalaw01 at gcc dot gnu.org ---
Ugh, initializing the scalar replacement for the first half of d, with a value
read from the first half of d (should be from the first half of *.LC0).

[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization

2016-03-07 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013

--- Comment #5 from alalaw01 at gcc dot gnu.org ---
Prior to SRA, we have
  d = *.LC0;
  d$0$f0_7 = MEM[(struct S0[2] *)&*.LC0].f0;
  e$f0_9 = MEM[(struct S0[2] *)&d + 3B].f0;
  _3 = (int) d$0$f0_7;
  c = _3;
  _5 = (int) e$f0_9;
  __builtin_printf ("%x\n", _5);

sra_modify_assign for d=*.LC0 ends up in load_assign_lhs_subreplacements, where
d has two children; the second is grp_to_be_replaced, but because we did not 
completely_scalarize LC0, there is an access to only the first half of *.LC0,
and no corresponding RHS for the second half of d ('racc =
find_access_in_subtree (sad->top_racc, offset, lacc->size' returns null). So we
generate the bad

d$3$f0_14 = MEM[(struct S0[2] *)&d + 3B].f0;

that is, initializing the scalar replacement for the second half of d, with a
value read from the first half of d.

[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization

2016-03-07 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||alalaw01 at gcc dot gnu.org

--- Comment #4 from alalaw01 at gcc dot gnu.org ---
Hmmm. First thing I notice is that the type of d (struct S0[2]) is not
scalarizable_type_p, but passes type_internals_preclude_sra_p. Changing the
latter to bail out on DECL_BIT_FIELD (as the former does) fixes the ICE, but
I'm not yet sure we want to do that.

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-03-03 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #87 from alalaw01 at gcc dot gnu.org ---
Great, many thanks for the tests, I was worried if we had hit another distinct
issue. (Of course this would be better on gcc-patches!)

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-03-03 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #84 from alalaw01 at gcc dot gnu.org ---
Bah. Do you normally use -fno-aggressive-loop-optimizations? With
-funknown-commons, did you try with/out aggressive loop opts?
Powerpc{,64}{be,le} ?

The unknown-commons testcase I included in that patch looks to pass on
powerpc64le-unknown-linux-gnu.

Does HJ Lu's spec source-patching work on powerpc following r232559?

I am not a lawyer...but I don't think the SPEC2006 license allows me to upload
onto the GCC Compile Farm and runspec. So if you could narrow down to an object
file that's broken with a recent compiler and -funknown-commons, with the rest
compiled with a gcc prior to r232508, that'd be very helpful - then I could see
what assembly I'm changing (and what expressions equal_mem_array_ref is falsely
declaring equivalent)...?

[Bug bootstrap/60632] ICE in regcprop.c (copyprop_hardreg_forward_1)

2016-03-03 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60632

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 CC||alalaw01 at gcc dot gnu.org
 Resolution|--- |WORKSFORME

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
Sorry, no idea...

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-03-02 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #82 from alalaw01 at gcc dot gnu.org ---
For those who haven't seen it, I've put forward this patch on the mailing list:
https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01746.html based on a suggestion
from Jakub. (Unlike Richi's comment72 patch, this fixes 416.gamess on AArch64.)

[Bug tree-optimization/65963] Missed vectorization of loads strided with << when equivalent * succeeds

2016-02-23 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from alalaw01 at gcc dot gnu.org ---
Can I class this as fixed?

[Bug middle-end/66877] [6 Regression] FAIL: gcc.dg/vect/vect-over-widen-3-big-array.c -flto -ffat-lto-objects scan-tree-dump-times vect "vect_recog_over_widening_pattern: detected" 2

2016-02-23 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66877

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from alalaw01 at gcc dot gnu.org ---
Fix committed r232720.

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-23 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #79 from alalaw01 at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #78)
>
> That would pessimize it too much IMHO.

I'm not sure how to evaluate the pessimization, given it's thought to be a
widespread pseudo-FORTRAN construct; so I probably have to defer to your
judgement here. However...

Given maxsize of an array as two elements, say, would the compiler not be
entitled to optimize an index selection down to, say, computing only the LSBit
of the actual index?  Whereas 'unknown' means, well, exactly what is the case.
So I fear this is storing problems up for the future.

Is the concern that we can't hide this behind an option, as that would "drive
people away from gfortran" ? If that's the case, can we hide it behind an
option that defaults to pessimization (?? at least for fortran)??

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-23 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #77 from alalaw01 at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #72)
> 
> Patch as posted passed bootstrap & regtest.  Adjusted according to 
> comments but not tested otherwise - please somebody throw at
> unpatched 416.gamess.

Still miscompares on aarch64, I'm afraid. (Both with and without
-fno-aggressive-loop-optimizations.)

Also where Jakub wrote:
> If you want to go this way, I'd at least key it off DECL_COMMON on the decl.
> And instead of multiplying max_size by 2 perhaps just add BITS_PER_UNIT?

I wonder why you prefer setting such an arbitrary guess at max_size rather than
going with -1 which is defined as "unknown" ?

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-18 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #53 from alalaw01 at gcc dot gnu.org ---
(In reply to Thomas Koenig from comment #44)
> I don't have access to SPEC, so I can only guess... Is there maybe an 
> equivalence involved, something like

Turns out the COMMON is accessed via a MEM_REF in a loop, or as a VAR_DECL
inside. Go figure! :)

(In reply to Dominique d'Humieres from comment #49)
> I don't see the point to add yet another option just because "SPEC does not
> want to change the invalid Fortran". I think SPEC should be run with the
> option(s) causing the problem disabled.

Anecdotally I hear from Fortran-using colleagues this may occur in other places
too. Moreover, the list of phases using get_ref_base_and_extent, is long; we
could end up compiling with an ever-growing -fno-this -fno-that as more and
more phases make use of the "bad" analysis results (that is correct by the
language spec after all). In this case, there are a few other equivalences
found due to the tree-ssa-scopedtables.c changes, that we'd lose with
-fno-tree-dominator-opts, too.

(In reply to H.J. Lu from comment #52)
>
>So, there is nothing to fix in GCC? Why isn't this bug closed as invalid?

Not everyone wants to patch SPEC sources.

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-18 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #43 from alalaw01 at gcc dot gnu.org ---
Yeah, I plan to add a fortran-specific option for this, it's easy enough, but I
can't run the gfortran testsuite with that, because there are lots of C files
in there too, for which the compiler doesn't accept the option...

I'm having trouble writing a testcase though. My subroutine with

IMPLICIT DOUBLE PRECISION (X)
COMMON /MYCOMMON / X(1)

produces "mycommon.x" a COMPONENT_REF, but with "mycommon" being a MEM_REF,
which requires only the hunk to tree-dfa.c to handle correctly; whereas in
SPEC2006, what looks to me to be equivalent FORTRAN, ends up with "mycommon"
being a VAR_DECL, which requires the much-bigger patch to the fortran FE...

I've very little fortran experience here, any tips?

Thanks, Alan

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-17 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #39 from alalaw01 at gcc dot gnu.org ---
Created attachment 37726
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37726&action=edit
Proposed patch (without flag).

Here's a prototype patch, that sets TYPE_SIZE to NULL_TREE but leaves DECL_SIZE
intact.  For the moment I'm applying this universally, rather than gating under
a flag, to ease testing check-fortran.  Only
gfortran.dg/gomp/appendix-a/a.24.1.f90 fails; in practice I think it's OK just
to not use the new code in conjunction with -fopenmp.

On AArch64, it fixes the 416.gamess issue, and allows compiling 416.gamess
without the -fno-aggressive-loop-optimizations previously required.

Also bootstraps and passes check-gcc check-fortran check-g++, on aarch64 and
x86_64, except as noted above. I expect to add a Fortran-only flag to gate the
trans-common.c changes before taking this to gcc-patches@ .

The worry is that while many cases in the mid-end were happy with a null
TYPE_SIZE, I still had to patch up a couple, so the worry is I might not have
got them all.  (Indeed, omp-low.c had too many!) I'm not sure this is any worse
than adding a new flag to the decl (indicating that the DECL_SIZE is not to be
trusted) and then trying to find all the cases where the DECL_SIZE is wrongly
relied upon - with the latter approach, the compiler would generate invalid
code, rather than "failing fast".

Thoughts welcome!

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-09 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #37 from alalaw01 at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #36)
> As Richard said, you can do similar (invalid too) stuff in C too, say:
> struct S { int a[1]; } s;
> in one TU and
> struct S { int a[1]; } s;
> 
> int
> foo (int x)
> {
>   return s.a[x];
> }
> 
> int
> bar (int x)
> {
>   return s.a[1 + x] + s.a[0] + s.a[x];
> }
> 
> GCC 5 would compile it to what the author might have meant, while GCC 6 will
> optimize bar into s.a[0] * 3;

Yes, this was what I meant in comment #33. The question is, do we care? (Or, do
we only care in the FORTRAN case?)

If so, then we presumably want a -fbroken-common-blocks (or something!) that is
not FE-specific.

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-08 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #33 from alalaw01 at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #31)
> 
> Thus a "fix" for the case where treating a[i] as a[0] is the issue
> would be
> 
> Index: gcc/tree-dfa.c
> ===
> --- gcc/tree-dfa.c  (revision 233172)
> +++ gcc/tree-dfa.c  (working copy)
> @@ -617,7 +617,11 @@ get_ref_base_and_extent (tree exp, HOST_
>if (maxsize == -1
>   && DECL_SIZE (exp)
>   && TREE_CODE (DECL_SIZE (exp)) == INTEGER_CST)
> -   maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset;
> +   {
> + maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset;
> + if (maxsize == size)
> +   maxsize = -1;
> +   }
>  }
>else if (CONSTANT_CLASS_P (exp))
>  {

So is there a case where we want this for C ?

If I declare a struct with a VLA, and access it through a pointer - GCC
recognizes the VLA idiom and keeps the accesses. If I access it from a decl,
yes we optimize away the out-of-bounds accesses (in FRE, long before we reach
the tree-ssa-scopedtables changes). So OK, if I access it from a extern or
__attribute__((weak) decl, which I then get the linker to replace with a bigger
decl, then I get "wrong" code (it ignores the extra elements in the bigger
decl) - but I'd say that was invalid code.

So if this is Fortran-only, we probably have to hook off --std=legacy, right?

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-05 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #32 from alalaw01 at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #31)
>
> Thus a "fix" for the case where treating a[i] as a[0] is the issue
> would be
> 
> Index: gcc/tree-dfa.c
> ===
> --- gcc/tree-dfa.c  (revision 233172)
> +++ gcc/tree-dfa.c  (working copy)
> @@ -617,7 +617,11 @@ get_ref_base_and_extent (tree exp, HOST_
>if (maxsize == -1
>   && DECL_SIZE (exp)
>   && TREE_CODE (DECL_SIZE (exp)) == INTEGER_CST)
> -   maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset;
> +   {
> + maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset;
> + if (maxsize == size)
> +   maxsize = -1;
> +   }
>  }
>else if (CONSTANT_CLASS_P (exp))
>  {

Maybe if we only did that for DECL_COMMONs if -std=legacy was in force?

Tho as you say:

> but that wouldn't fix the aggressive-loop optimization issue as that is
> _not_ looking at DECL_SIZE but at the array types domain.

I wonder if we can't get both places looking at the same thing (DECL_SIZE or
array type domain), but I haven't looked into that at all.

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-05 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|INVALID |FIXED

--- Comment #27 from alalaw01 at gcc dot gnu.org ---
(In reply to Richard Biener from comment #25)
> (In reply to alalaw01 from comment #23)
> > Well, this one is not fixed by -fno-aggressive-loop-optimizations.
> 
> No, that just disabled one symptom of the issue at that point in time. 
> Fixing the issue also fixes this occurance (well, I hope so ;))

So by "fixing the issue" - we mean, making --std=legacy prevent this (as
although against the SPEC, colleagues with more FORTRAN knowledge than I
suggest this is common)? SPEC seem to be saying they will not change the
source: https://www.spec.org/cpu2006/Docs/faq.html#Run.05


As Jakub suggested in comment #13:

> So, perhaps we want some flag on the Fortran COMMON decls that would be set 
> on > COMMON that ends with an array and would tell get_ref_base_and_extent 
> (and
> other spots?) that accesses can be beyond end of the decl?

but only if --std=legacy ? ? ?

Should I raise a new bug for this, as both this and 53068 are CLOSED?

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-04 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|DUPLICATE   |FIXED

--- Comment #23 from alalaw01 at gcc dot gnu.org ---
Well, this one is not fixed by -fno-aggressive-loop-optimizations.

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-04 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #20 from alalaw01 at gcc dot gnu.org ---
Hmmm, hang on. In unport.fppized.f, shouldn't we be using the 'F2C/GCC COMPILER
ON PC RUNNING UNIX (LINUX,BSD386,ETC)' version? In which case X has size (1)
everywhere?

[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-03 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #10 from alalaw01 at gcc dot gnu.org ---
The stores are getting optimized out because equal_mem_array_ref_p considers
equal pairs of MEM_REFS like

fmcom.x[_168] and fmcom.x[_208]

That is, a ARRAY_REF whose first operand is a COMPONENT_REF fmcom.x (of a
VAR_DECL and a FIELD_DECL), and whose second operand is an SSA_NAME _168 or
_208; I don't see anything obvious to suggest that they should be equal).

get_ref_base_and_extent then returns base=fmcom, size=64, max_size=64 (so not a
variable-sized access), and offset 0 :-(.

[Bug middle-end/66877] [6 Regression] FAIL: gcc.dg/vect/vect-over-widen-3-big-array.c -flto -ffat-lto-objects scan-tree-dump-times vect "vect_recog_over_widening_pattern: detected" 2

2016-01-22 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66877

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|WAITING |ASSIGNED
 CC||alalaw01 at gcc dot gnu.org
   Assignee|rguenth at gcc dot gnu.org |alalaw01 at gcc dot 
gnu.org

--- Comment #7 from alalaw01 at gcc dot gnu.org ---
I can test on ARM ;), so taken -
https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01727.html.

[Bug testsuite/69380] [6 Regression] FAIL: g++.dg/tree-ssa/pr69336.C scan-tree-dump-not optimized "cmap"

2016-01-21 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69380

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Target|arm-none-eabi powerpc*-*-*  |arm-none-eabi powerpc*-*-*
   ||aarch64*-*-*
 CC||alalaw01 at gcc dot gnu.org

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
adding "--param max-sra-scalarization-size-Ospeed=72" makes the testcase pass;
or we can XFAIL on arm, AArch64, powerpc (presumably also hppa, alpha, and
others). 72 is quite large; thoughts?

(This suggests that when we move the logic in gimplify_init_constructor, for
pushing stuff out to the constant pool, into tree-sra.c, we may also want to
refine it a bit. But in the meantime...)

[Bug tree-optimization/69352] [6 Regression] profiledbootstrap failure with --with-build-config=bootstrap-lto

2016-01-19 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69352

--- Comment #9 from alalaw01 at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #7)
> There are various bugs in the r232508 change.
> The
>   gcc_assert (sz0 == sz1);
>   gcc_assert (max0 == max1);
>   gcc_assert (rev0 == rev1);
> asserts are clearly bogus, while for compatible type I bet size will be
> always the same, maximum size can be arbitrary (it will be either equal to
> size, then it is a fixed access, or it will be larger, then it is a variable
> access), and the reverse stuff looks weird (e.g. I think the lack of
> REF_REVERSE_STORAGE_ORDER testing in operand_equal_p is a bug).  For a
> variable access, even if you remove the above max{0,1} assert, I think you
> would happily equate say a[i] and a[j] ARRAY_REFs, because they have the
> same off (likely 0) and max (likely size of array in bits).  Another problem
> I see in the
>   return equal_mem_array_ref_p (expr0->ops.single.rhs,
> expr1->ops.single.rhs)
>  || operand_equal_p (expr0->ops.single.rhs,
>  expr1->ops.single.rhs, 0);
> case; under some conditions you decide to hash the MEM_REF/ARRAY_REFs as
> MEM_REF , hash of base, offset and size, so you should use the same
> conditions to decide if you use equal_mem_array_ref_p or operand_equal_p,

Agreed, yes. That would fix the bogus asserts, right - we would then only use
equal_mem_array_ref_p if size==max_size.

> Plus, I'm not sure in what places this hashing
> is used, I'm worried you might hash MEM_REFs with different alias types for
> the accesses as equal, which for some uses might be fine, if you are not
> trying to replace one with another etc., but for other cases it might lead
> to wrong-code.

I think it should be OK to ignore differences in alias type for DOM
optimizations etc., indeed, which is where this was intended.

[Bug tree-optimization/69336] Constant value not detected

2016-01-18 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69336

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||alalaw01 at gcc dot gnu.org

--- Comment #4 from alalaw01 at gcc dot gnu.org ---
That looks reasonable, AFAICT get_ref_base_and_extent will deal with anything
that is handled_component_p. The same patch enables the optimization on
aarch64, with appropriate --param sra-max-scalarization-size-Ospeed to pull the
constant-pool entry in.

[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.

2016-01-18 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679

--- Comment #40 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Mon Jan 18 12:40:43 2016
New Revision: 232508

URL: https://gcc.gnu.org/viewcvs?rev=232508&root=gcc&view=rev
Log:
Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c

PR target/63679

gcc/:

* tree-ssa-scopedtables.c (avail_expr_hash): Hash MEM_REF and ARRAY_REF
using get_ref_base_and_extent.
(equal_mem_array_ref_p): New.
(hashable_expr_equal_p): Add call to previous.

gcc/testsuite/:

* gcc.dg/tree-ssa/ssa-dom-cse-5.c: New.
* gcc.dg/tree-ssa/ssa-dom-cse-6.c: New.
* gcc.dg/tree-ssa/ssa-dom-cse-7.c: New.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-scopedtables.c

[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.

2016-01-18 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679

--- Comment #39 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Mon Jan 18 12:29:02 2016
New Revision: 232506

URL: https://gcc.gnu.org/viewcvs?rev=232506&root=gcc&view=rev
Log:
Make SRA scalarize constant-pool loads

PR target/63679

gcc/ChangeLog:

* tree-sra.c (disqualified_constants, constant_decl_p): New.
(sra_initialize): Allocate disqualified_constants.
(sra_deinitialize): Free disqualified_constants.
(disqualify_candidate): Update disqualified_constants when appropriate.
(create_access): Scan for constant-pool entries as we go along.
(scalarizable_type_p): Add check against type_contains_placeholder_p.
(maybe_add_sra_candidate): Allow constant-pool entries.
(load_assign_lhs_subreplacements): Bind debug for constant pool vars.
(initialize_constant_pool_replacements): New.
(sra_modify_assign): Avoid mangling assignments created by previous,
and don't generate writes into constant pool.
(sra_modify_function_body): Call initialize_constant_pool_replacements.

gcc/testsuite/:

* gcc.dg/tree-ssa/sra-17.c: New.
* gcc.dg/tree-ssa/sra-18.c: New.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-17.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-18.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-sra.c

[Bug middle-end/68112] [6 Regression] FAIL: gcc.target/i386/avx512ifma-vpmaddhuq-2.c (test for excess errors)

2016-01-13 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68112

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from alalaw01 at gcc dot gnu.org ---
Sorry - I believe this was fixed by r229660 (a reversion of the originating
r229437), and should still be fixed following the alternative r229825. Can you
(HJ?) please reopen if that is not the case.

[Bug target/69053] [6 Regression] ICE in build_vector_from_val

2016-01-12 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69053

--- Comment #9 from alalaw01 at gcc dot gnu.org ---
I can confirm that both Richi's patch in comment 6 and my patchlet in comment
3, pass bootstrap + check-gcc on ARM and AArch64, and fix the ICE observed on
ARM. (ICE never observed on AArch64.)

[Bug tree-optimization/67682] Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is

2016-01-11 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from alalaw01 at gcc dot gnu.org ---
Yes, r230330.

[Bug tree-optimization/69166] [6 Regression] ICE in get_initial_def_for_reduction, at tree-vect-loop.c:4188

2016-01-08 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69166

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
   Last reconfirmed||2016-01-08
 CC||alalaw01 at gcc dot gnu.org
 Resolution|DUPLICATE   |---
 Ever confirmed|0   |1

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
No, not a dup - 69053 results from a type mismatch/missing conversion building
the initial value for a COND_EXPR; this PR is because the 'reduction' is an
RDIV_EXPR, which get_initial_def_for_reduction doesn't handle.

The testcase invokes undefined behaviour, too (e is not initialized). Moving
'double *e' to be a parameter to fn2 avoids the ICE.

[Bug target/69053] [6 Regression] ICE in build_vector_from_val

2016-01-06 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69053

--- Comment #3 from alalaw01 at gcc dot gnu.org ---
Well, this fixes it, but I'm not sure it fixes it in the right place...

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index ee32166..bd66aa5 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -4178,7 +4178,9 @@ get_initial_def_for_reduction (gimple *stmt, tree
init_val
break;
  }
  }
-   init_def = build_vector_from_val (vectype, init_value);
+   init_def = build_vector_from_val (vectype,
+ fold_convert (TREE_TYPE (vectype),
+   init_value));
break;

   default:

[Bug target/69053] [6 Regression] ICE in build_vector_from_val

2016-01-05 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69053

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
build_vector_from_val then gets called to build a vector (4) unsigned long,
from an int* (which is the right signedness and size, but being a pointer it is
not types_compatible_p).

[Bug target/69053] [6 Regression] ICE in build_vector_from_val

2016-01-05 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69053

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-01-05
 CC||alahay01 at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from alalaw01 at gcc dot gnu.org ---
Yes. The r230423 change means x86 now has a reduc_umax_scal_optab for V4DI,
causing the loop to be vectorized as a COND_REDUCTION. (It is not vectorized on
e.g. AArch64, as that platform has reduc_umax_scal_optabs only for vector modes
with smaller elements, not V2DI).

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-22 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

--- Comment #23 from alalaw01 at gcc dot gnu.org ---
Yes, difficult. I'm conscious that this is stage 3, and worried about adding
too much complexity, especially if we're writing code that we'd eventually drop
in favour of a more complete framework later (i.e. in gcc7).

I'm inclined against

> (I wondered
> if load-lanes would require more unrolling we should prefer SLP anyway?).

As we've seen cases where load-lanes requires more unrolling but the code is
still much better. Likewise your argument against

> to query whether _all_ loads need to be permuted with SLP
...
> thus if there is a load node which is not permuted then retain the SLP.

seems convincing. I think the heuristic in comment 16 handles permutation well
enough, and beyond that, sharing (rather than the permutation) then appears to
be the critical factor. Unfortunately as you say SLP doesn't really handle
sharing yet...so

> I fear that to get a better heuristic
> than what is proposed we need to push this for example to
> vect_make_slp_decision where all instances are built

Might be reasonable, but I fear it'd be of dubious benefit without:

> and we'd need to gather some sharing data therein.

I guess if that were a useful step towards

> But then there is only a small step to the point where we could actually
> compare SLP vs. non-SLP costs.

then there is some justification, but the former feels like too much complexity
at this stage - especially to do it well; how much do we really want to gather
data on the sharing that exists at present, rather than looking at removing
that sharing entirely? I'm thinking of e.g. SLP nodes that are performing the
same computations but with different permutations too - shouldn't we be aiming
at making permutations into first class citizens/operations, and making SLP
trees into DAGs? Longer-term goals, sure...

So my instinct is to go with the comment 16 patch, and accept that we take the
hit in that last testcase (i.e. the one with the sharing).

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-17 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

--- Comment #21 from alalaw01 at gcc dot gnu.org ---
Here's the smallest testcase I could come up with (where SLP gets cancelled,
but we end up with fewer st2's than before)...the key seems to be things being
used in multiple places.

#define N 1024

int in1[N], in2[N];
int out1[N], out2[N];
int w[N];

void foo() {
  for (int i = 0; i < N; i+=2)
{
  int a = in1[i] & in2[i];
  int b = in1[i+1] & in2[i+1];
  out1[i] = a;
  out1[i+1] = b;
  out2[i] = (a + w[i]) ^ (b+w[i+1]);
  out2[i+1] = (b + w[i]) ^ (a+w[i+1]);
}
}

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-17 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

--- Comment #20 from alalaw01 at gcc dot gnu.org ---
> Would be nice to have a reduced testcase for this one.

Working on it. Sadly it's fortran :(

The SLP tree that gets cancelled, is quite big (and quite untreelike, if we
could see that - a large portion, 7 nodes, is repeated but with the 2 stmts in
each SLP node reversed). "Decided to SLP 2 instances" indeed becomes "Decided
to SLP 1 instances", with Unrolling factor 2 both times. In the case where the
SLP gets cancelled, several more stmts that would have featured in that tree
are marked hybrid. The 'vector inside of loop cost' increases from 180 (with
SLP) to 308 (if cancelled), but minimum iters for profitability stays at 3.
However, the SLP-cancelled case, outputs a whole extra section 

note: === scheduling SLP instances ===
...
note: -->vectorizing SLP node starting from: (one of the loads in the
cancelled tree) * 4
...
note: vectorizing stmts using SLP.

(Tho I suspect that's a red herring.)

Whereas later the non-cancelled case, clearly has an extra 'note: add new stmt:
MEM[...] = STORE_LANES'...sounding as if perhaps the SLP finds it can use ST2
opportunistically (??).

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-16 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

--- Comment #18 from alalaw01 at gcc dot gnu.org ---
Well, we've seen this patch fix some of the vectorizer performance regressions
we've had on some benchmarks.

On SPEC...the "SLP cancelled" case triggers all over the place, but in most of
those cases, doesn't lead to any codegen difference. (Presumably SLP would have
failed anyway for some other reason, e.g. costs, and either we generate
load/store-lanes either way, or we still *can't* generate load/store-lanes...).
The only sub-benchmark where codegen changes is facerec, where we seem to
*lose* st2 rather than gainthis needs more analysis.

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-14 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

--- Comment #13 from alalaw01 at gcc dot gnu.org ---
Hmmm, I realize a "definite" codegen improvement was maybe a bad choice of
wording. A "substantial" (albeit uncertain!) improvement, may have been more
accurate...

However, yes it looks like we want that patch (indeed, it still helps even when
we up the cost of permute operations and drop the -fno-vect-cost-model) - so
thanks, Richard. We'll clean up the testisms in due course.

In the longer term, is the issue here, that we aren't comparing costs of SLP vs
load-lanes, right? We merely compare the cost of whichever of those
vectorization strategies we favour, permutes et al, vs leaving it in scalar
code?

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-11 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

--- Comment #10 from alalaw01 at gcc dot gnu.org ---
This causes to FAIL the scan-tree-dump-times 'vectorizing stmts using SLP' in
slp-perm-{1,2,3,5,6,7,8,11}.c. Looking at the assembler before and after...

slp-perm-1.c: this looks a big win; several st3's are generated instead of many
stp's, we lose all the tbl's, and many constant-pool entries consisting of
'byte's are removed, with the corresponding ADRP's. The loop is fully unrolled
in both cases, and the new code is much shorter (48 instructions rather than
95).

slp-perm-2.c: less clear, but looks like an overall win. Loop gets unrolled by
factor of 2; each "half" loses a TRN1 and a TRN2 but gains an ORR (move).

slp-perm-3.c: Again we lose a load of constants and ADRPs (outside the
4-iteration loop), gaining some MOVIs. With the patchlet, the loop gets fully
unrolled, and loses 4*tbl per iteration (!). Still executing 8*mul, 8*mla,
4*add, but dropping the TBLs again makes for a win.

slp-perm-5: less clear, but again looks like an overall win. Both loops have
been fully unrolled, and the combining of stores doesn't help much (we seem to
gain as many moves as we lose stores!). but with the patch, we lose several
TBLs and TRNs. Also an MLA becomes a MUL.

A side comment would be that if we could 'fix' the register allocation here, to
put things into the right place ready for the stN rather than moving it there
later, we'd have quite a big win...but that's another issue.

Also a recurring theme is that the vec_(load/store)_lanes approach seems to
make much better use of movi, rather than pushing things into the constant
pool. I haven't really looked into this, it may be fundamental, or just a
limitation of our current code for loading immediates.

slp-perm-6: some wins from constants, and dropping 8 tbls.

slp-perm-7.c: Similarly.

slp-perm-8.c: Loop here iterates 4 times, and the ld3/st3 manages to lose us
4*move and 9*tbl per iteration (!); huge improvement.

slp-perm-11.c: a 16-iteration loop gets unrolled *2, and now uses an st2, but
no load_lanes, just a bunch of ldr's: 10 rather than the original 3(*2). 3 strs
become 4 stp's (+st2). Doesn't look like an improvement!

However, 7 out of 8 cases look better, in some cases much better. So I'd say
that was a definite codegen improvement :).

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-08 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

--- Comment #8 from alalaw01 at gcc dot gnu.org ---
Adding a check against BB SLP avoids some regressions caused by bailing out of
BB SLP when we can't then do a load/store-lanes.

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-08 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

--- Comment #6 from alalaw01 at gcc dot gnu.org ---
Well, I can confirm that the patch generates load-lanes/store-lanes instead of
SLP, all over the (vect) testsuite. All execution tests are passing :) so it
*may* just be a case of updating a lot of scan-tree-dump tests but we'll need
to do at least some performance evaluation, watch this space.

[Bug tree-optimization/68707] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-04 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

--- Comment #1 from alalaw01 at gcc dot gnu.org ---
Created attachment 36929
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36929&action=edit
tree-vect-details dump (after patch, with SLP)

[Bug tree-optimization/68707] New: testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

2015-12-04 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707

Bug ID: 68707
   Summary: testcase gcc.dg/vect/O3-pr36098.c vectorized using
VEC_PERM_EXPR rather than VEC_LOAD_LANES
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64, arm

Created attachment 36928
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36928&action=edit
tree-vect-details dump (before patch, with LOAD_LANES)

Prior to r230993, O3-pr36098.c (at -O3) was vectorized using a LOAD_LANES /
STORE_LANES, resulting in:

.L5:
ld4 {v4.4s - v7.4s}, [x7], 64
add w4, w4, 1
cmp w3, w4
orr v1.16b, v4.16b, v4.16b
orr v2.16b, v5.16b, v5.16b
orr v3.16b, v6.16b, v6.16b
st3 {v1.4s - v3.4s}, [x6], 48
bhi .L5

each iteration of the outer loop processes a struct of 4 ints, of which the
first 3 are copied to a destination. The ld4 nicely gets us four structs with
all the elements we want in three registers row-wise (and the elements we don't
want in a fourth):
struct1 struct2 struct3 struct4
v4.s[0] v4.s[1] v4.s[2] v4.s[3]
v5.s[0] v5.s[1] v5.s[2] v5.s[3]
v6.s[0] v6.s[1] v6.s[2] v6.s[3]
v7.s[0] v7.s[1] v7.s[2] v7.s[3]
and st3 stores the desired rows (only) to the right locations.

Following r230993, instead the loop gets unrolled four times, four vectors are
loaded sequentially, and then permuted by SLP:

.L5:
ldr q0, [x5, 16]
add x4, x4, 48
ldr q1, [x5, 32]
add w6, w6, 1
ldr q4, [x5, 48]
cmp w3, w6
ldr q2, [x5], 64
orr v3.16b, v0.16b, v0.16b
orr v5.16b, v4.16b, v4.16b
orr v4.16b, v1.16b, v1.16b
tbl v0.16b, {v0.16b - v1.16b}, v6.16b
tbl v2.16b, {v2.16b - v3.16b}, v7.16b
tbl v4.16b, {v4.16b - v5.16b}, v16.16b
str q0, [x4, -32]
str q2, [x4, -48]
str q4, [x4, -16]
bhi .L5

that is, we load

struct1 struct2 struct3 struct4
v2.s[0] v0.s[0] v1.s[0] v4.s[0]
v2.s[1] v0.s[1] v1.s[1] v4.s[1]
v2.s[2] v0.s[2] v1.s[2] v4.s[2]
v2.s[3] v0.s[3] v1.s[3] v4.s[3]

and then permute

struct1 struct2 struct3 struct4
v2.s[0] v2.s[3] v0.s[2] v4.s[1]
v2.s[1] v0.s[0] v0.s[3] v4.s[2]
v2.s[2] v0.s[1] v4.s[0] v4.s[3]

so we then have the data 'columnwise' and store each sequentially.

[Bug tree-optimization/68681] New: testcase gcc.dg/vect/pr45752.c fails on AArch64

2015-12-03 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68681

Bug ID: 68681
   Summary: testcase gcc.dg/vect/pr45752.c fails on AArch64
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

Created attachment 36900
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36900&action=edit
tree-vect-details dump

Since r231015 (https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03371.html), on
AArch64 we have

FAIL: gcc.dg/vect/pr45752.c scan-tree-dump-times vect "gaps requires scalar
epilogue loop" 0
FAIL: gcc.dg/vect/pr45752.c -flto -ffat-lto-objects  scan-tree-dump-times vect
"gaps requires scalar epilogue loop" 0

I attach -fdump-tree-vect-details from the non-lto case (line 5379:
gcc/testsuite/gcc.dg/vect/pr45752.c:45:3: note: Data access with gaps requires
scalar epilogue loop)

[Bug tree-optimization/68549] [6 Regression] ICE: in verify_loop_structure, at cfgloop.c:1669

2015-11-26 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68549

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||alalaw01 at gcc dot gnu.org

--- Comment #8 from alalaw01 at gcc dot gnu.org ---
Here's another testcase, reduced from value.c in gdb - ICEs at -O2 on (at
least) x86_64 and AArch64:

typedef long unsigned int size_t;
extern void *xmalloc (size_t) __attribute__ ((__malloc__)) __attribute__
((__returns_nonnull__));
struct __jmp_buf_tag
  {
  };
extern int __sigsetjmp (struct __jmp_buf_tag __env[1], int __savemask)
__attribute__ ((__nothrow__));
typedef struct __jmp_buf_tag sigjmp_buf[1];
extern sigjmp_buf *exceptions_state_mc_init (void);
extern int exceptions_state_mc_action_iter (void);
extern void printf_unfiltered (const char *, ...)
;
extern struct gdbarch *get_current_arch (void);
struct internalvar
{
  struct internalvar *next;
};
static struct internalvar *internalvars;
struct internalvar *
create_internalvar (const char *name)
{
  struct internalvar *var = ((struct internalvar *) xmalloc (sizeof (struct
internalvar)));
  internalvars = var;
}
void
show_convenience ()
{
  struct gdbarch *gdbarch = get_current_arch ();
  int varseen = 0;
  for (struct internalvar *var = internalvars; var; var = var->next)
{
  if (!varseen)
varseen = 1;
  sigjmp_buf *buf = exceptions_state_mc_init ();
  __sigsetjmp ( (*buf), 1);
 while (exceptions_state_mc_action_iter ())
   while (exceptions_state_mc_action_iter ())
;
}
  if (!varseen)
  printf_unfiltered ( "" );
}

[Bug c/68385] New: ICE building libstdc++ on arm-none-eabi

2015-11-17 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68385

Bug ID: 68385
   Summary: ICE building libstdc++ on arm-none-eabi
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---
Target: arm-none-eabi

Created attachment 36738
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36738&action=edit
Reduced testcase

Starting with r230365, building gcc for arm-none-eabi falls over in libstdc++
with:

/work/alalaw01/build-arm-none-eabi/obj/gcc2/./gcc/xgcc -shared-libgcc
-B/work/alalaw01/build-arm-none-eabi/obj/gcc2/./gcc -nostdinc++
-L/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/src
-L/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/src/.libs
-L/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/libsupc++/.libs
-B/work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/bin/
-B/work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/lib/ -isystem
/work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/include -isystem
/work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/sys-include
-I/work/alalaw01/src/gcc/libstdc++-v3/../libgcc
-I/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/include/arm-none-eabi
-I/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/include
-I/work/alalaw01/src/gcc/libstdc++-v3/libsupc++ -fno-implicit-templates -Wall
-Wextra -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once
-ffunction-sections -fdata-sections -frandom-seed=eh_personality.lo -O2 -g -c
/work/alalaw01/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc -o
eh_personality.o
/work/alalaw01/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc: In function
'_Unwind_Reason_Code __cxxabiv1::__gxx_personality_v0(_Unwind_State,
_Unwind_Control_Block*, _Unwind_Context*)':
/work/alalaw01/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc:394:26:
internal compiler error: tree check: expected integer_cst, have nop_expr in
decompose, at tree.h:5123
  UNWIND_STACK_REG))
  ^

0xf8d589 tree_check_failed(tree_node const*, char const*, int, char const*,
...)
/work/alalaw01/src/gcc/gcc/tree.c:9587
0x10df3fd tree_check
/work/alalaw01/src/gcc/gcc/tree.h:3212
0x10df3fd wi::int_traits::decompose(long*, unsigned int,
tree_node const*)
/work/alalaw01/src/gcc/gcc/tree.h:5123
0x10df3fd wide_int_ref_storage
/work/alalaw01/src/gcc/gcc/wide-int.h:936
0x10df3fd generic_wide_int
/work/alalaw01/src/gcc/gcc/wide-int.h:714
0x10df3fd generic_simplify_172
/work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:6142
0x1113507 generic_simplify_EQ_EXPR
/work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:22841
0x111d719 generic_simplify(unsigned int, tree_code, tree_node*, tree_node*,
tree_node*)
/work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:25312
0xa182c8 fold_binary_loc(unsigned int, tree_code, tree_node*, tree_node*,
tree_node*)
/work/alalaw01/src/gcc/gcc/fold-const.c:9138
0xa227b2 fold_build2_stat_loc(unsigned int, tree_code, tree_node*, tree_node*,
tree_node*)
/work/alalaw01/src/gcc/gcc/fold-const.c:12333
0x10e00cd generic_simplify_46
/work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:2014
0x1112b27 generic_simplify_EQ_EXPR
/work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:22441
0x111d719 generic_simplify(unsigned int, tree_code, tree_node*, tree_node*,
tree_node*)
/work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:25312
0xa182c8 fold_binary_loc(unsigned int, tree_code, tree_node*, tree_node*,
tree_node*)
/work/alalaw01/src/gcc/gcc/fold-const.c:9138
0xa3ec75 fold(tree_node*)
/work/alalaw01/src/gcc/gcc/fold-const.c:11973
0x5bdff3 build_new_op_1
/work/alalaw01/src/gcc/gcc/cp/call.c:5730
0x5be299 build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*,
tree_node*, tree_node**, int)
/work/alalaw01/src/gcc/gcc/cp/call.c:5803
0x70f42f build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code,
tree_node*, tree_code, tree_node**, int)
/work/alalaw01/src/gcc/gcc/cp/typeck.c:3828
0x6e3b39 cp_parser_binary_expression
/work/alalaw01/src/gcc/gcc/cp/parser.c:8621
0x6e3cdc cp_parser_assignment_expression
/work/alalaw01/src/gcc/gcc/cp/parser.c:8742
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.

Reduced testcase attached:

$ arm-none-eabi-gcc -c reduced.cc
reduced.cc: In function 'bool __gxx_personality_v0(_Unwind_State,
_Unwind_Control_Block*, _Unwind_Context*)':
re

[Bug tree-optimization/65963] Missed vectorization of loads strided with << when equivalent * succeeds

2015-11-06 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963

--- Comment #4 from alalaw01 at gcc dot gnu.org ---
I confirm the testcase fails execution on armeb-none-eabi (also at -O0), but it
does so both with and without the patch to tree-scalar-evolution.c, which did
not change codegen (at -O2 -ftree-vectorize; the loop was not vectorized). So
this looks to be exposing a different, pre-existing, bug.

[Bug tree-optimization/65963] Missed vectorization of loads strided with << when equivalent * succeeds

2015-11-05 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Thu Nov  5 18:39:38 2015
New Revision: 229825

URL: https://gcc.gnu.org/viewcvs?rev=229825&root=gcc&view=rev
Log:
[PATCH] tree-scalar-evolution.c: Handle LSHIFT by constant

gcc/:

PR tree-optimization/65963
* tree-scalar-evolution.c (interpret_rhs_expr): Try to handle
LSHIFT_EXPRs as equivalent unsigned MULT_EXPRs.

gcc/testsuite/:

* gcc.dg/pr68112.c: New.
* gcc.dg/vect/vect-strided-shift-1.c: New.

Added:
trunk/gcc/testsuite/gcc.dg/pr68112.c
trunk/gcc/testsuite/gcc.dg/vect/vect-strided-shift-1.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-scalar-evolution.c

[Bug rtl-optimization/68182] ICE in reorder_basic_blocks_simple building libitm/beginend.cc

2015-11-02 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68182

--- Comment #1 from alalaw01 at gcc dot gnu.org ---
Created attachment 36636
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36636&action=edit
Preprocessed source (compressed)

[Bug rtl-optimization/68182] New: ICE in reorder_basic_blocks_simple building libitm/beginend.cc

2015-11-02 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68182

Bug ID: 68182
   Summary: ICE in reorder_basic_blocks_simple building
libitm/beginend.cc
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---
  Host: x86_64
Target: x86_64

Preprocessed source attached; command-line

$ /work/alalaw01/build/./gcc/xg++ -B/work/alalaw01/build/./gcc/ -mrtm -O1 -g
-m32 -c temp.ii
/work/alalaw01/src/gcc/libitm/beginend.cc: In static member function ‘static
uint32_t GTM::gtm_thread::begin_transaction(uint32_t, const gtm_jmpbuf*)’:
/work/alalaw01/src/gcc/libitm/beginend.cc:400:1: internal compiler error: in
operator[], at vec.h:714
 }
 ^
0x1310783 vec::operator[](unsigned int)
/work/alalaw01/src/gcc/gcc/vec.h:714
0x1310783 reorder_basic_blocks_simple
/work/alalaw01/src/gcc/gcc/bb-reorder.c:2322
0x1310783 reorder_basic_blocks
/work/alalaw01/src/gcc/gcc/bb-reorder.c:2450
0x1310783 execute
/work/alalaw01/src/gcc/gcc/bb-reorder.c:2551

[Bug tree-optimization/56118] Piecewise vector / complex initialization from constants not combined

2015-11-02 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56118

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||alalaw01 at gcc dot gnu.org

--- Comment #5 from alalaw01 at gcc dot gnu.org ---
*** Bug 68165 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/68165] Not constant-folding setting vector element

2015-11-02 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68165

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from alalaw01 at gcc dot gnu.org ---
Seems like a duplicate of 56118 to me.

*** This bug has been marked as a duplicate of bug 56118 ***

[Bug tree-optimization/68165] New: Not constant-folding setting vector element

2015-10-30 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68165

Bug ID: 68165
   Summary: Not constant-folding setting vector element
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---

I believe these two C functions are equivalent:
  typedef float __attribute__((__vector_size__ (2 * sizeof(float
float32x2_t;

  float32x2_t
  test_cprop ()
  {
float32x2_t vec = {0.0, 0.0};
vec[0] = 3.14f;
vec[1] = 2.71f;
return vec * ((float32x2_t) { 1.5f, 4.5f });
  }

  float32x2_t
  test_cprop2 ()
  {
  float32x2_t vec = {3.14f, 2.71f};
  return vec * ((float32x2_t) { 1.5f, 4.5f });
  }

at -O3 -fdump-tree-optimized, on AArch64:
=
;; Function test_cprop (test_cprop, funcdef_no=0, decl_uid=2603, cgraph_uid=0,
symbol_order=0)

test_cprop ()
{
  float32x2_t vec;
  vector(2) float vec.0_5;
  float32x2_t _6;

  :
  vec = { 0.0, 0.0 };
  BIT_FIELD_REF  = 3.141049041748046875e+0;
  BIT_FIELD_REF  = 2.7103814697265625e+0;
  vec.0_5 = vec;
  _6 = vec.0_5 * { 1.5e+0, 4.5e+0 };
  vec ={v} {CLOBBER};
  return _6;

}



;; Function test_cprop2 (test_cprop2, funcdef_no=1, decl_uid=2607,
cgraph_uid=1, symbol_order=1)

test_cprop2 ()
{
  :
  return { 4.7103814697265625e+0, 1.219499969482421875e+1 };

}
=
x86 is identical for test_cprop2, worse in test_cprop:
=
test_cprop ()
{
  float32x2_t vec;
  vector(2) float vec.0_5;
  float32x2_t _6;
  float _8;
  float _9;
  float _10;
  float _11;

  :
  vec = { 0.0, 0.0 };
  BIT_FIELD_REF  = 3.141049041748046875e+0;
  BIT_FIELD_REF  = 2.7103814697265625e+0;
  vec.0_5 = vec;
  _8 = BIT_FIELD_REF ;
  _9 = _8 * 1.5e+0;
  _10 = BIT_FIELD_REF ;
  _11 = _10 * 4.5e+0;
  _6 = {_9, _11};
  vec ={v} {CLOBBER};
  return _6;

}
=
i.e. we are not understanding the result of assigning to the BIT_FIELD_REF on
the whole vector, although we can resolve individual elements:
  float32x2_t
  test_cprop3 ()
  {
float32x2_t vec = {0.0, 0.0};
vec[0] = 3.14f;
vec[1] = 2.71f;
return (float32x2_t) {vec[0], vec[1]} * ((float32x2_t) { 1.5f, 4.5f });
  }

produces
=
test_cprop3 ()
{
  :
  return { 4.7103814697265625e+0, 1.219499969482421875e+1 };

}


[Bug middle-end/68112] [6 Regression] FAIL: gcc.target/i386/avx512ifma-vpmaddhuq-2.c (test for excess errors)

2015-10-29 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68112

--- Comment #4 from alalaw01 at gcc dot gnu.org ---
Sure, but gcc exploits undefinedness of multiply, so rewriting shift to
multiply is not equivalent in the general case :(.

One way forward might be to make definedness of overflow a bit finer-grained
(either on types, i.e. TYPE_OVERFLOW_DEFINED, or maybe as a property of
chrecs?)


[Bug middle-end/68112] [6 Regression] FAIL: gcc.target/i386/avx512ifma-vpmaddhuq-2.c (test for excess errors)

2015-10-28 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68112

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
So (a << CONSTANT) is not equivalent to a * (1<

[Bug tree-optimization/67683] Missed vectorization: shifts of an induction variable

2015-10-06 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=35226

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
Is there a way to do this kind of thing other than extending polynomial_chrec's
to understand operations other than addition ? Whilst beneficial, that looks to
be quite a large task.


[Bug tree-optimization/57558] Loop not vectorized if iteration count could be infinite

2015-09-25 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57558

--- Comment #4 from alalaw01 at gcc dot gnu.org ---
Here's another example, extracted from another benchmark - it vectorizes if
INDEX is defined to 'long' but not if INDEX is 'short':

#include 

unsigned char *t_run_test(unsigned char *in, int N)
{
  unsigned char *out = malloc (N);

  for (unsigned INDEX i = 1; i < (N - 1); i++)
out[i] = ((3 * in[i]) - in[i - 1] - in[i + 1]);

  return out;
}

However, the -Wunsafe-loop-optimizations doesn't give us anything useful here:

(successful case, warning printed)
$ aarch64-none-elf-gcc -O3 bmark2.c -DINDEX=long -S -Wunsafe-loop-optimizations
-fdump-tree-vect-details=stdout | grep vectorized
bmark2.c:7:3: note: === vect_mark_stmts_to_be_vectorized ===
bmark2.c:7:3: note: loop vectorized
bmark2.c:3:16: note: vectorized 1 loops in function.
bmark2.c: In function 't_run_test':
bmark2.c:3:16: warning: cannot optimize loop, the loop counter may overflow
[-Wunsafe-loop-optimizations]
 unsigned char *t_run_test(unsigned char *in, int N)

(unsuccessful case, no warning)
$ aarch64-none-elf-gcc -O3 bmark2.c -DINDEX=short -S
-Wunsafe-loop-optimizations -fdump-tree-vect-details=stdout | grep vectorized
bmark2.c:7:3: note: not vectorized: number of iterations cannot be computed.
bmark2.c:3:16: note: vectorized 0 loops in function.


[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop

2015-09-23 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
Being stupid here, but why does the outer loop having multiple exits matter -
it's the inner loop that should be vectorized?

FOO was a macro used to selectively make the test i>max disappear (enabling
vectorization) - the two commandlines had -DFOO=0 (vectorizes) and -DFOO=1
(doesn't).


[Bug tree-optimization/67683] New: Missed vectorization: shifts of an induction variable

2015-09-22 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683

Bug ID: 67683
   Summary: Missed vectorization: shifts of an induction variable
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
Blocks: 53947
  Target Milestone: ---

This testcase:

void test (unsigned char *data, int max)
{
  unsigned short val = 0xcdef;
  for(int i = 0; i < max; i++) { 
  data[i] = (unsigned char)(val & 0xff);
  val >>= 1; 
  }
}

does not vectorize on AArch64 or x86_64 at -O3. (I haven't yet looked at
whether it's a mid-end deficiency or both back-ends are missing patterns.)


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations


[Bug tree-optimization/67682] New: Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is

2015-09-22 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682

Bug ID: 67682
   Summary: Missed vectorization: (another) straight-line
memcpy/memset not vectorized when equivalent loop is
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

This code:

void
test (int*__restrict a, int*__restrict b)
{
a[0] = b[0];
a[1] = b[1];
a[2] = b[2];
a[3] = b[3];
a[4] = 0;
a[5] = 0;
a[6] = 0;
a[7] = 0;
}

is not vectorized; -fdump-tree-slp-details reveals

test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int
*)a_4(
D) + 28B] = 0;
test.c:4:13: note: original stmt *a_4(D) = _3;
test.c:4:13: note: === vect_slp_analyze_data_ref_dependences ===
test.c:4:13: note: === vect_slp_analyze_operations ===
test.c:4:13: note: not vectorized: bad operation in basic block.
test.c:4:13: note: * Re-trying analysis with vector size 8
...
test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int
*)a_4(D) + 28B] = 0;
test.c:4:13: note: original stmt *a_4(D) = _3;
test.c:4:13: note: === vect_slp_analyze_data_ref_dependences ===
test.c:4:13: note: === vect_slp_analyze_operations ===
test.c:4:13: note: not vectorized: bad operation in basic block.

(the failure with vector size 8 is expected, but vector size 4 should succeed)

Output is:
test:
ldp w4, w3, [x1]
ldp w2, w1, [x1, 8]
stp w4, w3, [x0]
stp w2, w1, [x0, 8]
stp wzr, wzr, [x0, 16]
stp wzr, wzr, [x0, 24]
ret

Curiously, a similar code but writing elements a[0..3] and a[5..8] (missing out
a[4]) is SLP'd, producing superior:

test:
ldr q0, [x1]
moviv1.4s, 0
str q1, [x0, 20]
str q0, [x0]
ret

And similarly for (equivalent to the first):

void
test (int*__restrict a, int*__restrict b)
{
  for (int i = 0; i < 4; i++)
a[i] = b[i];
  for (int i = 4; i < 8; i++)
a[i] = 0;
}

producing:

test:
moviv0.4s, 0
ldp x2, x3, [x1]
stp x2, x3, [x0]
str q0, [x0, 16]
ret


[Bug tree-optimization/67681] New: Missed vectorization: induction variable used after loop

2015-09-22 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681

Bug ID: 67681
   Summary: Missed vectorization: induction variable used after
loop
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
  Target Milestone: ---

The inner loop here:
void addlog2 (int *data)
{
  int i = 1;
  for (int j=0; j<=30; j++) {
int max = 1 << j;
if (FOO && i>max) break;

for (; i <= max; i++)
  data[i] += j;
  }
}

does not vectorize if the if(FOO...) is present:
$ /work/alalaw01/build-aarch64-none-elf/install/bin/aarch64-none-elf-gcc -S -O2
-ftree-vectorize -fdump-tree-vect-details=stdout loop9b.c -DFOO=1 | grep
vectorized
loop9b.c:1:6: note: not vectorized: inner-loop count not invariant.
loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized ===
loop9b.c:8:5: note: not vectorized: value used after loop.
loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized ===
loop9b.c:8:5: note: not vectorized: value used after loop.
loop9b.c:1:6: note: vectorized 0 loops in function.


$ aarch64-none-elf-gcc -S -O2 -ftree-vectorize -fdump-tree-vect-details=stdout
loop9b.c -DFOO=0 | grep vectorized
loop9b.c:4:3: note: not vectorized: inner-loop count not invariant.
loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized ===
loop9b.c:8:5: note: loop vectorized
loop9b.c:1:6: note: vectorized 1 loops in function.

Same with -O3. Of course clever analysis could figure out that i>max is never
true, but even without that, we should be able to get 'i' back afterwards.


[Bug middle-end/65965] Straight-line memcpy/memset not vectorized when equivalent loop is

2015-09-22 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65965

--- Comment #4 from alalaw01 at gcc dot gnu.org ---
(In reply to Richard Biener from comment #3)
> Fixed for GCC 6.

Indeed. I note that the same testcase does _not_ SLP/vectorize if I use
consecutive indices:

void
test (int*__restrict a, int*__restrict b)
{
a[0] = b[0];
a[1] = b[1];
a[2] = b[2];
a[3] = b[3];
a[4] = 0;
a[5] = 0;
a[6] = 0;
a[7] = 0;
}

loop26a.c:6:13: note: Build SLP failed: different operation in stmt MEM[(int
*)a
_4(D) + 28B] = 0;
loop26a.c:6:13: note: original stmt *a_4(D) = _3;
loop26a.c:6:13: note: === vect_slp_analyze_data_ref_dependences ===
loop26a.c:6:13: note: === vect_slp_analyze_operations ===
loop26a.c:6:13: note: not vectorized: bad operation in basic block.

Worth another bug?


[Bug tree-optimization/67283] GCC regression over inlining of returned structures

2015-09-18 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283

--- Comment #13 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Fri Sep 18 10:55:11 2015
New Revision: 227901

URL: https://gcc.gnu.org/viewcvs?rev=227901&root=gcc&view=rev
Log:
completely_scalarize arrays as well as records.

gcc/:

PR tree-optimization/67283
* tree-sra.c (type_consists_of_records_p): Rename to...
(scalarizable_type_p): ...this, add case for ARRAY_TYPE.
(completely_scalarize_record): Rename to...
(completely_scalarize): ...this, add ARRAY_TYPE case, move some code
to:
(scalarize_elem): New.
(analyze_all_variable_accesses): Follow renamings.

gcc/testsuite/:

* gcc.dg/tree-ssa/sra-15.c: New.
* gcc.dg/tree-ssa/sra-16.c: New.


Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-15.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-16.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-sra.c


[Bug target/63870] [Aarch64] [ARM] Errors in use of NEON intrinsics are reported incorrectly

2015-09-08 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63870

--- Comment #10 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Tue Sep  8 19:43:39 2015
New Revision: 227557

URL: https://gcc.gnu.org/viewcvs?rev=227557&root=gcc&view=rev
Log:
ARM/AArch64 Testsuite] Add float16 lane_f16_indices tests

PR target/63870
* gcc.target/aarch64/advsimd-intrinsics/vld2_lane_f16_indices_1.c: New.
* gcc.target/aarch64/advsimd-intrinsics/vld2q_lane_f16_indices_1.c:
New.
* gcc.target/aarch64/advsimd-intrinsics/vld3_lane_f16_indices_1.c: New.
* gcc.target/aarch64/advsimd-intrinsics/vld3q_lane_f16_indices_1.c:
New.
* gcc.target/aarch64/advsimd-intrinsics/vld4_lane_f16_indices_1.c: New.
* gcc.target/aarch64/advsimd-intrinsics/vld4q_lane_f16_indices_1.c:
New.
* gcc.target/aarch64/advsimd-intrinsics/vst2_lane_f16_indices_1.c: New.
* gcc.target/aarch64/advsimd-intrinsics/vst2q_lane_f16_indices_1.c:
New.
* gcc.target/aarch64/advsimd-intrinsics/vst3_lane_f16_indices_1.c: New.
* gcc.target/aarch64/advsimd-intrinsics/vst3q_lane_f16_indices_1.c:
New.
* gcc.target/aarch64/advsimd-intrinsics/vst4_lane_f16_indices_1.c: New.
* gcc.target/aarch64/advsimd-intrinsics/vst4q_lane_f16_indices_1.c:
New.

Added:
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld2_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld2q_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld3_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld3q_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld4_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld4q_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst2_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst2q_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst3_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst3q_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst4_lane_f16_indices_1.c
   
trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst4q_lane_f16_indices_1.c
Modified:
trunk/gcc/testsuite/ChangeLog


[Bug target/67439] ICE: unrecognizable insn compiling arm-fp16 testcases with -march=armv7-a and -mrestrict-it

2015-09-03 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67439

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-09-03
 CC||alalaw01 at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
I can reproduce the ICE with -mthumb, both "-mfloat-abi=hard -mfpu=neon" and
"-mfloat-abi=soft", but only with -mrestrict-it in both cases.
"-mfloat-abi=hard -mfpu=neon-fp16" is OK with and without -mrestrict-it.

I note the movhf patterns in vfp.md are only usable with neon-fp16; in other
cases, we appear to be using arm32_movhf in arm.md.


[Bug tree-optimization/67283] GCC regression over inlining of returned structures

2015-08-28 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283

--- Comment #12 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Fri Aug 28 15:04:17 2015
New Revision: 227303

URL: https://gcc.gnu.org/viewcvs?rev=227303&root=gcc&view=rev
Log:
Revert: completely_scalarize arrays as well as records

gcc/:
Revert:
2015-08-27  Alan Lawrence  
PR tree-optimization/67283
* tree-sra.c (type_consists_of_records_p): Rename to...
(scalarizable_type_p): ...this, add case for ARRAY_TYPE.

(completely_scalarize_record): Rename to...
(completely_scalarize): ...this, add ARRAY_TYPE case, move some
 code to:
(scalarize_elem): New.

gcc/testsuite/:

Revert:
2015-08-27  Alan Lawrence  
* gcc.dg/tree-ssa/sra-15.c: New.

Removed:
trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-15.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-sra.c


[Bug tree-optimization/67283] GCC regression over inlining of returned structures

2015-08-27 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||alalaw01 at gcc dot gnu.org

--- Comment #8 from alalaw01 at gcc dot gnu.org ---
I believe this should now be fixed. Do we want a testcase, and if so is there a
good way to scan for the stack usage pattern (as observed in the assembler)?
One can scan-assembler times for addq.*%rsp, but fixing the constant 72 seems
rather fragile, and I don't see a dejagnu way to scan for the constant being
the same in each demoN()...

And the case of unions is still not handled!!


[Bug tree-optimization/67283] GCC regression over inlining of returned structures

2015-08-27 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283

--- Comment #7 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Thu Aug 27 15:40:10 2015
New Revision: 227265

URL: https://gcc.gnu.org/viewcvs?rev=227265&root=gcc&view=rev
Log:
completely_scalarize arrays as well as records

gcc/:

PR tree-optimization/67283
* tree-sra.c (type_consists_of_records_p): Rename to...
(scalarizable_type_p): ...this, add case for ARRAY_TYPE.

(completely_scalarize_record): Rename to...
(completely_scalarize): ...this, add ARRAY_TYPE case, move some code
to:
(scalarize_elem): New.

gcc/testsuite/:

* gcc.dg/tree-ssa/sra-15.c: New.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-15.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-sra.c


[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.

2015-08-03 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679

--- Comment #37 from alalaw01 at gcc dot gnu.org ---
Hmmm, no it's not the hashing - that pretty much ignores all types. It's the
comparison in hashable_expr_equal_p, which just uses operand_equal_p,
specifically this part (in fold-const.c):

case MEM_REF:
  /* Require equal access sizes, and similar pointer types.
 We can have incomplete types for array references of
 variable-sized arrays from the Fortran frontend
 though.  Also verify the types are compatible.  */
  if (!((TYPE_SIZE (TREE_TYPE (arg0)) == TYPE_SIZE (TREE_TYPE (arg1))
   || (TYPE_SIZE (TREE_TYPE (arg0))
   && TYPE_SIZE (TREE_TYPE (arg1))
   && operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)),
   TYPE_SIZE (TREE_TYPE (arg1)),
flags)))
  && types_compatible_p (TREE_TYPE (arg0), TREE_TYPE (arg1))
  && ((flags & OEP_ADDRESS_OF)
  || (alias_ptr_types_compatible_p
(TREE_TYPE (TREE_OPERAND (arg0, 1)),
 TREE_TYPE (TREE_OPERAND (arg1, 1)))
  && (MR_DEPENDENCE_CLIQUE (arg0)
  == MR_DEPENDENCE_CLIQUE (arg1))
  && (MR_DEPENDENCE_BASE (arg0)
  == MR_DEPENDENCE_BASE (arg1))
  && (TYPE_ALIGN (TREE_TYPE (arg0))
== TYPE_ALIGN (TREE_TYPE (arg1)))

specifically, a pointer to int, and a pointer to an array of int, are not
alias_ptr_types_compatible_p. (I'm not clear that they should be, either!?)


[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.

2015-07-29 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679

--- Comment #35 from alalaw01 at gcc dot gnu.org ---
So it should be happening in dom2. On x86, input to dom2 is

  vect_cst_.9_31 = { 0, 1, 2, 3 };
[...]MEM[(int *)&a] = vect_cst_.9_31;
[...]vect__13.3_20 = MEM[(int *)&a];

resulting in:

Optimizing statement vect_cst_.9_31 = { 0, 1, 2, 3 };
LKUP STMT vect_cst_.9_31 = { 0, 1, 2, 3 }
 ASGN vect_cst_.9_31 = { 0, 1, 2, 3 }
...
Optimizing statement MEM[(int *)&a] = vect_cst_.9_31;
  Replaced 'vect_cst_.9_31' with constant '{ 0, 1, 2, 3 }'
LKUP STMT MEM[(int *)&a] = { 0, 1, 2, 3 } with .MEM_3(D)
LKUP STMT { 0, 1, 2, 3 } = MEM[(int *)&a] with .MEM_3(D)
LKUP STMT { 0, 1, 2, 3 } = MEM[(int *)&a] with .MEM_17
2>>> STMT { 0, 1, 2, 3 } = MEM[(int *)&a] with .MEM_17
...
Optimizing statement vect__13.3_20 = MEM[(int *)&a];
LKUP STMT vect__13.3_20 = MEM[(int *)&a] with .MEM_21
FIND: { 0, 1, 2, 3 }
  Replaced redundant expr 'MEM[(int *)&a]' with '{ 0, 1, 2, 3 }'

My version has input to dom2:

  vect_cst_.8_27 = { 0, 1, 2, 3 };
[...]MEM[(int[8] *)&a] = vect_cst_.8_27;
[...]vect__8.3_20 = MEM[(int *)&a];

Optimizing statement vect_cst_.8_27 = { 0, 1, 2, 3 };
LKUP STMT vect_cst_.8_27 = { 0, 1, 2, 3 }
 ASGN vect_cst_.8_27 = { 0, 1, 2, 3 }
...
Optimizing statement MEM[(int[8] *)&a] = vect_cst_.8_27;
  Replaced 'vect_cst_.8_27' with constant '{ 0, 1, 2, 3 }'
LKUP STMT MEM[(int[8] *)&a] = { 0, 1, 2, 3 } with .MEM_3(D)
LKUP STMT { 0, 1, 2, 3 } = MEM[(int[8] *)&a] with .MEM_3(D)
LKUP STMT { 0, 1, 2, 3 } = MEM[(int[8] *)&a] with .MEM_17
2>>> STMT { 0, 1, 2, 3 } = MEM[(int[8] *)&a] with .MEM_17
...
Optimizing statement vect__8.3_20 = MEM[(int *)&a];
LKUP STMT vect__8.3_20 = MEM[(int *)&a] with .MEM_21
2>>> STMT vect__8.3_20 = MEM[(int *)&a] with .MEM_21

Which looks like MEM[(int *)&a] and MEM[(int[8] *)&a] are hashing differently
and hence dom2 is not finding it.

Could be that I need my SRA to output something closer to
  a[1] = 1;
where I currently have
  MEM[(int[8] *)&a + 4B] = 1;
but also feel that those two statements hashing differently is not really
helpful!


[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.

2015-07-28 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||alalaw01 at gcc dot gnu.org

--- Comment #32 from alalaw01 at gcc dot gnu.org ---
Is the SRA approach going to work? I have hacked up my SRA so that it generates
this:

foo ()
{
  int sum;
  int i;
  const int a[8];
  unsigned int i.0_7;
  int _8;
  unsigned int i.0_19;

  :
  MEM[(int[8] *)&a] = 0;
  MEM[(int[8] *)&a + 4B] = 1;
  MEM[(int[8] *)&a + 8B] = 2;
  MEM[(int[8] *)&a + 12B] = 3;
  MEM[(int[8] *)&a + 16B] = 4;
  MEM[(int[8] *)&a + 20B] = 5;
  MEM[(int[8] *)&a + 24B] = 6;
  MEM[(int[8] *)&a + 28B] = 7;
  i.0_19 = 0;
  if (i.0_19 != 8)
goto ;
  else
goto ;

  :
  # i_20 = PHI 
  # sum_21 = PHI 
  _8 = a[i_20];
  sum_9 = sum_21 + _8;
  i_10 = i_20 + 1;
  i.0_7 = (unsigned int) i_10;
  if (i.0_7 != 8)
goto ;
  else
goto ;

  :
  # sum_22 = PHI 
  a ={v} {CLOBBER};
  return sum_22;
}

the vectorizer then transforms to:
...
  :
  MEM[(int[8] *)&a] = 0;
  MEM[(int[8] *)&a + 4B] = 1;
  MEM[(int[8] *)&a + 8B] = 2;
  MEM[(int[8] *)&a + 12B] = 3;
  MEM[(int[8] *)&a + 16B] = 4;
  MEM[(int[8] *)&a + 20B] = 5;
  MEM[(int[8] *)&a + 24B] = 6;
  MEM[(int[8] *)&a + 28B] = 7;

  :
  # i_20 = PHI <0(2), i_10(4)>
  # sum_21 = PHI <0(2), sum_9(4)>
  # ivtmp_19 = PHI <8(2), ivtmp_22(4)>
  # vectp_a.1_1 = PHI <&a(2), vectp_a.1_2(4)>
  # vect_sum_9.4_17 = PHI <{ 0, 0, 0, 0 }(2), vect_sum_9.4_23(4)>
  # ivtmp_27 = PHI <0(2), ivtmp_28(4)>
  vect__8.3_18 = MEM[(int *)vectp_a.1_1];
  _8 = a[i_20];
  vect_sum_9.4_23 = vect__8.3_18 + vect_sum_9.4_17;
  sum_9 = _8 + sum_21;
  i_10 = i_20 + 1;
  ivtmp_22 = ivtmp_19 - 1;
  vectp_a.1_2 = vectp_a.1_1 + 16;
  ivtmp_28 = ivtmp_27 + 1;
  if (ivtmp_28 < 2)
goto ;
  else
goto ;

  :
  goto ;

  :
  # sum_7 = PHI 
  # vect_sum_9.4_24 = PHI 
  stmp_sum_9.5_25 = [reduc_plus_expr] vect_sum_9.4_24;
  vect_sum_9.6_26 = stmp_sum_9.5_25 + 0;
  a ={v} {CLOBBER};
  return vect_sum_9.6_26;

}

and the optimized tree is:

foo ()
{
  int vect_sum_9.6;
  int stmp_sum_9.5;
  vector(4) int vect_sum_9.4;
  const vector(4) int vect__8.3;
  const int a[8];

  :
  MEM[(int[8] *)&a] = { 0, 1, 2, 3 };
  MEM[(int[8] *)&a + 16B] = { 4, 5, 6, 7 };
  vect__8.3_20 = MEM[(int *)&a];
  vect__8.3_18 = MEM[(int *)&a + 16B];
  vect_sum_9.4_23 = vect__8.3_18 + vect__8.3_20;
  stmp_sum_9.5_25 = [reduc_plus_expr] vect_sum_9.4_23;
  vect_sum_9.6_26 = stmp_sum_9.5_25;
  a ={v} {CLOBBER};
  return vect_sum_9.6_26;
}

final assembly is:
ldr q1, .LC1
sub sp, sp, #32
ldr q0, .LC2
add sp, sp, 32
add v0.4s, v0.4s, v1.4s
addvs0, v0.4s
umovw0, v0.s[0]
ret
which is a slight improvement, but not really what we are looking for...


[Bug target/66964] Assembler error during ARM cross compile

2015-07-23 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66964

--- Comment #7 from alalaw01 at gcc dot gnu.org ---
No new regressions bootstrapping that path on gcc-5-branch (--with-arch=armv7-a
--with-fpu=neon-fp16 --with-float=hard). However, compiling the testcase with
-dp reveals the bad strd's are actually coming from the *movdf_vfp pattern in
vfp.md.


[Bug target/66964] Assembler error during ARM cross compile

2015-07-22 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66964

--- Comment #6 from alalaw01 at gcc dot gnu.org ---
Bootstrap+test in progress FYI. However, that patch *does not* fix this
failure; there must be some other route.


[Bug target/66791] New: Replace builtins with gcc vector extensions code

2015-07-07 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66791

Bug ID: 66791
   Summary: Replace builtins with gcc vector extensions code
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alalaw01 at gcc dot gnu.org
Blocks: 47562
  Target Milestone: ---
Target: arm

Lots of ARM neon intrinsics are implemented using builtins backing onto
patterns in neon.md. These are opaque to the midend, but we could rewrite them
using equivalent gcc vector operations, that would be transparent to the midend
but would still eventually be turned into the same instructions. This would
enable more optimization in the midend.

Many of the AArch64 intrinsics have been implemented in this way so AArch64
arm_neon.h may provide some useful templates.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562
[Bug 47562] [meta-bug] keep track of Neon Intrinsics enhancements


[Bug target/65956] [5/6 Regression] Another ARM overaligned arg passing issue

2015-07-06 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65956

--- Comment #6 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Mon Jul  6 17:37:50 2015
New Revision: 225470

URL: https://gcc.gnu.org/viewcvs?rev=225470&root=gcc&view=rev
Log:
Backport r225466: tests from 'Fix eipa_src AAPCS issue (PR target/65956)'

2015-05-05  Jakub Jelinek  

PR target/65956
* gcc.c-torture/execute/pr65956.c: New test.

Added:
branches/gcc-5-branch/gcc/testsuite/gcc.c-torture/execute/pr65956.c
Modified:
branches/gcc-5-branch/gcc/testsuite/ChangeLog


[Bug target/65956] [5/6 Regression] Another ARM overaligned arg passing issue

2015-07-06 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65956

--- Comment #5 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Mon Jul  6 17:32:07 2015
New Revision: 225469

URL: https://gcc.gnu.org/viewcvs?rev=225469&root=gcc&view=rev
Log:
2015-07-06  Alan Lawrence  

Backport from mainline r225465
2015-07-06  Alan Lawrence  

gcc/:

PR target/65956
* config/arm/arm.c (arm_needs_doubleword_align): Drop any outer
alignment attribute, exploring one level down for records and arrays.

gcc/testsuite/:

* gcc.target/arm/aapcs/align1.c: New.
* gcc.target/arm/aapcs/align_rec1.c: New.
* gcc.target/arm/aapcs/align2.c: New.
* gcc.target/arm/aapcs/align_rec2.c: New.
* gcc.target/arm/aapcs/align3.c: New.
* gcc.target/arm/aapcs/align_rec3.c: New.
* gcc.target/arm/aapcs/align4.c: New.
* gcc.target/arm/aapcs/align_rec4.c: New.
* gcc.target/arm/aapcs/align_vararg1.c: New.
* gcc.target/arm/aapcs/align_vararg2.c: New.


Added:
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align1.c
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align2.c
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align3.c
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align4.c
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_rec1.c
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_rec2.c
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_rec3.c
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_rec4.c
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_vaarg1.c
branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_vaarg2.c
Modified:
branches/gcc-5-branch/gcc/ChangeLog
branches/gcc-5-branch/gcc/config/arm/arm.c
branches/gcc-5-branch/gcc/testsuite/ChangeLog


[Bug target/65956] [5/6 Regression] Another ARM overaligned arg passing issue

2015-07-06 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65956

--- Comment #4 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Mon Jul  6 17:06:00 2015
New Revision: 225466

URL: https://gcc.gnu.org/viewcvs?rev=225466&root=gcc&view=rev
Log:
Fix eipa_src AAPCS issue (PR target/65956)

2015-05-05  Jakub Jelinek  

PR target/65956
* gcc.c-torture/execute/pr65956.c: New test.


Added:
trunk/gcc/testsuite/gcc.c-torture/execute/pr65956.c
Modified:
trunk/gcc/testsuite/ChangeLog


[Bug target/65956] [5/6 Regression] Another ARM overaligned arg passing issue

2015-07-06 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65956

--- Comment #3 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Mon Jul  6 16:58:16 2015
New Revision: 225465

URL: https://gcc.gnu.org/viewcvs?rev=225465&root=gcc&view=rev
Log:
[ARM] PR/65956 AAPCS update for alignment attribute

gcc/:
PR target/65956
* config/arm/arm.c (arm_needs_doubleword_align): Drop any outer
alignment attribute, exploring one level down for records and arrays.

gcc/testsuite/:

* gcc.target/arm/aapcs/align1.c: New.
* gcc.target/arm/aapcs/align_rec1.c: New.
* gcc.target/arm/aapcs/align2.c: New.
* gcc.target/arm/aapcs/align_rec2.c: New.
* gcc.target/arm/aapcs/align3.c: New.
* gcc.target/arm/aapcs/align_rec3.c: New.
* gcc.target/arm/aapcs/align4.c: New.
* gcc.target/arm/aapcs/align_rec4.c: New.
* gcc.target/arm/aapcs/align_vararg1.c: New.
* gcc.target/arm/aapcs/align_vararg2.c: New.


Added:
trunk/gcc/testsuite/gcc.target/arm/aapcs/align1.c
trunk/gcc/testsuite/gcc.target/arm/aapcs/align2.c
trunk/gcc/testsuite/gcc.target/arm/aapcs/align3.c
trunk/gcc/testsuite/gcc.target/arm/aapcs/align4.c
trunk/gcc/testsuite/gcc.target/arm/aapcs/align_rec1.c
trunk/gcc/testsuite/gcc.target/arm/aapcs/align_rec2.c
trunk/gcc/testsuite/gcc.target/arm/aapcs/align_rec3.c
trunk/gcc/testsuite/gcc.target/arm/aapcs/align_rec4.c
trunk/gcc/testsuite/gcc.target/arm/aapcs/align_vaarg1.c
trunk/gcc/testsuite/gcc.target/arm/aapcs/align_vaarg2.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/arm/arm.c
trunk/gcc/testsuite/ChangeLog


[Bug middle-end/65946] Simple loop with if-statement not vectorized

2015-07-02 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65946

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
Author: alalaw01
Date: Thu Jul 2 12:47:31 2015
New Revision: 225311

URL: https://gcc.gnu.org/viewcvs?rev=225311&root=gcc&view=rev
Log:
gcc/:

* tree-pass.h (make_pass_ch_vect): New.
* passes.def: Add pass_ch_vect just before pass_if_conversion.

* tree-ssa-loop-ch.c (ch_base, pass_ch_vect, pass_data_ch_vect,
pass_ch::process_loop_p, pass_ch_vect::process_loop_p,
make_pass_ch_vect): New.
(pass_ch): Extend ch_base.

(pass_ch::execute): Move all but loop_optimizer_init/finalize to...
(ch_base::copy_headers): ...here.

gcc/testsuite/:

* gcc.dg/vect/vect-strided-a-u16-i4.c (main1): Narrow scope of x,y,z,w.
* gcc.dg/vect/vect-ifcvt-11.c: New testcase.


[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2015-07-02 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 65946, which changed state.

Bug 65946 Summary: Simple loop with if-statement not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65946

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED


[Bug target/64134] (vector float){0, 0, b, a} Uses stores when it does not need to

2015-06-26 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64134

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from alalaw01 at gcc dot gnu.org ---
Fixed by r29.


[Bug tree-optimization/57600] Turn 2 comparisons into 1 with the min

2015-06-19 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57600

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||alalaw01 at gcc dot gnu.org

--- Comment #5 from alalaw01 at gcc dot gnu.org ---
(In reply to Marc Glisse from comment #2)
>
> Or do we want to do the transformation always, and maybe have something
> later (in RTL?) to undo it if it didn't help?
> 
> Note that in some experiments with more meat in the loop, having
> i some optimizations.

Can you give an example where it not only doesn't help, but actually hurts? Are
they all just because of not seeing analysis properties, i.e. we could get
there by realizing a<=min(a,...) and looking far enough to see a

[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64

2015-06-17 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952

--- Comment #8 from alalaw01 at gcc dot gnu.org ---
(In reply to alalaw01 from comment #7)
> (In reply to Richard Biener from comment #6)
> > So aarch64 has no DImode vectors?  Or just no DImode multiply (but it has a
> > DImode vector shift?).
> 
> Yes, the latter.

Sorry, aarch64 has a DImode multiply, but no V2DImode multiply; and it has
V2DImode shifts.


[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64

2015-06-17 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952

--- Comment #7 from alalaw01 at gcc dot gnu.org ---
(In reply to Richard Biener from comment #6)
> So aarch64 has no DImode vectors?  Or just no DImode multiply (but it has a
> DImode vector shift?).

Yes, the latter.


[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64

2015-06-17 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952

--- Comment #5 from alalaw01 at gcc dot gnu.org ---
So the above example tends to get fully unrolled, but even on an example with
32 ptrs rather than 4, yes the vectorizer fails because of the multiplication -
but the multiplication is gone by the final tree stage, as it's strength
reduced down to an add; I believe this -fdump-tree-optimized would be perfectly
vectorizable:

loop ()
{
  unsigned long ivtmp.12;
  unsigned long ivtmp.10;
  void * _4;
  struct my_struct * _7;
  struct my_struct * pretmp_11;
  unsigned long _20;

  :
  pretmp_11 = array;
  ivtmp.10_16 = (unsigned long) pretmp_11;
  ivtmp.12_2 = (unsigned long) &ptrs;
  _20 = (unsigned long) &MEM[(void *)&ptrs + 256B];

  :
  # ivtmp.10_10 = PHI 
  # ivtmp.12_15 = PHI 
  _7 = (struct my_struct *) ivtmp.10_10;
  _4 = (void *) ivtmp.12_15;
  MEM[base: _4, offset: 0B] = _7;
  ivtmp.10_1 = ivtmp.10_10 + 16;
  ivtmp.12_14 = ivtmp.12_15 + 8;
  if (ivtmp.12_14 != _20)
goto ;
  else
goto ;

  :
  return;

}


[Bug tree-optimization/61171] vectorization fails for a reduction in presence of subtraction

2015-06-17 Thread alalaw01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61171

alalaw01 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||alalaw01 at gcc dot gnu.org

--- Comment #2 from alalaw01 at gcc dot gnu.org ---
This vectorizes fine, if vv is made a local variable:

float isOk() {
  float vv = 0;
  for (int j=0U; j

  1   2   >