[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Assignee|alalaw01 at gcc dot gnu.org|unassigned at gcc dot gnu.org --- Comment #88 from alalaw01 at gcc dot gnu.org --- Can this now be closed, or should I leave open for possible Fortran FE warnings?
[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #43 from alalaw01 at gcc dot gnu.org --- I think this can be closed now? I've raised PR/70189 for the followup enhancement.
[Bug middle-end/70189] New: Combine constant-pool logic from gimplify + SRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70189 Bug ID: 70189 Summary: Combine constant-pool logic from gimplify + SRA Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Following PR/63679 (r232506), gimplify.c (gimplify_init_constructor) uses lots of heuristics to choose between pushing initializers out to the constant pool (by calling tree_output_constant_def) or outputting many elementwise statements. Then, in tree-sra.c (analyze_all_variable_accesses), we use more heuristics to decide which constant-pool loads to completely_scalarize, turning those back into elementwise statements. (These get pulled back in from the constant pool and the constant-pool entry deleted.) Both of these sets of heuristics are platform dependent (gimplify.c uses can_move_by_pieces, CLEAR_RATIO; tree-sra.c uses get_move_ratio). Instead we should put all this logic in one place; this would make it clearer, and we'd probably get better overall decisions. The suggestion is for gimplify.c to always push out to the constant pool, as this makes initial tree the same on all platforms, and for all the logic/heuristics to go into SRA (as, being later, we then have more information available to maybe make better decisions in the future).
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #13 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Fri Mar 11 12:08:01 2016 New Revision: 234138 URL: https://gcc.gnu.org/viewcvs?rev=234138&root=gcc&view=rev Log: Fix PR/70013 gcc: PR tree-optimization/70013 * tree-sra.c (analyze_access_subtree): Also set grp_unscalarized_data for constant-pool entries. gcc/testsuite: * gcc.dg/tree-ssa/sra-20.c: New. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-20.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #12 from alalaw01 at gcc dot gnu.org --- Thanks, Martin - yes, I see. Patch posted at https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00680.html after full regtest.
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #8 from alalaw01 at gcc dot gnu.org --- Indeed, the -DFOO=1 case vectorizes with -fno-tree-dominator-opts.
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #7 from alalaw01 at gcc dot gnu.org --- Looking at where the peeling happens. In both -DFOO=0 and -DFOO=1 cases, 107.ch2 peels the inner loop header, so there is an i<=max test in the outer loop before the inner loop. However, in the -DFOO=1 case, this is dominated by the extra i>max test (that breaks out of the outer loop), so 110.dom2 removes the peeled i<=max. Thus, just before sccp, in the -DFOO=0 case, we have: : # i_25 = PHI # j_26 = PHI max_7 = 1 << j_26; if (max_7 >= i_25) goto ; else goto ; //skip inner loop : //inner loop header # i_2 = PHI _8 = (long unsigned int) i_2; _9 = _8 * 4; _11 = data_10(D) + _9; _12 = *_11; _13 = _12 + j_26; *_11 = _13; i_15 = i_2 + 1; if (max_7 >= i_15) goto ; //cleaned, actually via latch else goto ; note the inner loop exits if !(max_7 >= i_15), and when we hit the inner loop, we know that (max_7 >= i_25). Whereas in the -DFOO=1 case: : goto ; : //in outer loop max_7 = 1 << j_17; if (max_7 < i_32) goto ; else goto ; : //outer loop header # max_24 = PHI # i_22 = PHI # j_23 = PHI : //inner loop header # i_27 = PHI _8 = (long unsigned int) i_27; _9 = _8 * 4; _11 = data_10(D) + _9; _13 = *_11; _14 = _13 + j_23; *_11 = _14; i_16 = i_27 + 1; if (i_16 <= max_24) goto ; //cleaned, actually via latch else goto ; the inner loop exits if !(max_24 >= i_16), but max_24 is defined as PHI, and we only have that max_7max) break" out of the loop, such that the outer loop now executes "if (i>max) break" after the inner loop (rather than testing "if (i>max) break" before the inner loop, as it still did following 107.ch2). So as an alternative, possibly tweaking the jump-threading/loop-peeling heuristics might help (?).
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #5 from alalaw01 at gcc dot gnu.org --- In the -DFOO=0 case, we have peeled an extra copy of the inner loop condition, i <= max_7, above the loop. scalar evolution (final_value_replacement_loop) works, because it sees the inner loop goes round niter = (unsigned int) max_7 - (unsigned int) i_25 iterations, and compute_overall_effect_of_inner_loop gives us (int) (((unsigned int) i_25 + ((unsigned int) max_7 - (unsigned int) i_25)) + 1) which is not expression_expensive_p, so we do it. Hence the add/subtract above. When -DFOO=1, we have not done that peeling, so niter = i_22 <= max_24 ? (unsigned int) max_24 - (unsigned int) i_22 : 0, and compute_overall_effect_of_inner_loop gives us (i_22 + 1) + (i_22 <= max_24 ? (int) ((unsigned int) max_24 - (unsigned int) i_22) : 0) which is expression_expensive_p, so we don't do the final value replacement.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #10 from alalaw01 at gcc dot gnu.org --- Hmmm, so this fixes the ICE, generating: SR.5_12 = MEM[(struct S0[2] *)&*.LC0].f0; MEM[(struct S0[2] *)&*.LC0].f0 = SR.5_12; d = *.LC0; d$3$f0_14 = MEM[(struct S0[2] *)&*.LC0 + 3B].f0; d$0$f0_7 = SR.5_12; e$f0_9 = d$3$f0_14; _3 = (int) d$0$f0_7; c = _3; _5 = (int) e$f0_9; __builtin_printf ("%x\n", _5); d ={v} {CLOBBER}; return 0; which in -fdump-tree-optimized (at -O1) looks like: SR.5_12 = MEM[(struct S0[2] *)&*.LC0].f0; d$3$f0_14 = MEM[(struct S0[2] *)&*.LC0 + 3B].f0; _3 = (int) SR.5_12; c = _3; _5 = (int) d$3$f0_14; __builtin_printf ("%x\n", _5); return 0; which is much saner. But I don't really understand why the PARM_DECL case that I'm adding to here is that way (since r147980 "New implementation of SRA" in 2009, https://gcc.gnu.org/ml/gcc-patches/2009-04/msg02218.html)... Bootstrapped+regtest on AArch64 (c,c++) and ARM (c,c++,ada), no regressions. (Constants don't get pushed into the pool on x86.) diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c index 72157edd02e3235e57b786bbf460c94b0c52b2c5..24eac6ae7c4dcd41358b1a020047076afe1a8106 100644 --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -2427,7 +2427,8 @@ analyze_access_subtree (struct access *root, struct access *parent, if (!hole || root->grp_total_scalarization) root->grp_covered = 1; - else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL) + else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL + || constant_decl_p (root->base)) root->grp_unscalarized_data = 1; /* not covered and written to */ return sth_created; }
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #4 from alalaw01 at gcc dot gnu.org --- loopinit introduces the exit phi in much the same way for both -DFOO=0 and -DFOO=1, so the difference is in sccp. In the -DFOO=0 case, sccp does this (removing TODO_cleanup_cfg from pass_data_scev_cprop to make the diff easier, still vectorizes): ;; Function addlog2 (addlog2, funcdef_no=0, decl_uid=2749, cgraph_uid=0, symbol_order=0) + +final value replacement: + i_21 = PHI + with + i_21 = (int) _3; + ...[snip]... : - # i_21 = PHI + _19 = (unsigned int) i_25; + _18 = (unsigned int) max_7; + _17 = (unsigned int) i_25; + _5 = _18 - _17; + _4 = _5 + _19; + _3 = _4 + 1; + i_21 = (int) _3; In the -DFOO=1 case, sccp doesn't do anything; and adding -fno-tree-scev-cprop prevents vectorization of the -DFOO=0 case.
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #3 from alalaw01 at gcc dot gnu.org --- So in the not-vectorized case (-DFOO=1), we get for the inner loop: : # i_27 = PHI _8 = (long unsigned int) i_27; _9 = _8 * 4; _11 = data_10(D) + _9; _13 = *_11; _14 = _13 + j_23; *_11 = _14; i_16 = i_27 + 1; if (i_16 <= max_24) goto ; else goto ; : goto ; : # i_32 = PHI the loop exit phi, i_32=PHI, makes i_16=i_27+1 relevant (vec_stmt_relevant_p: used out of loop.), so we go through that on the worklist and then i_27=PHI, marking the phi as STMT_VINFO_LIVE_P, and hence "not vectorized: value used after loop". Kind of as expected, FORNOW. In the -DFOO=0 case, a bunch of loop peeling, header-copying, and other transforms, end up with this input to vectorization: : //header of inner loop # i_2 = PHI _8 = (long unsigned int) i_2; _9 = _8 * 4; _11 = data_10(D) + _9; _12 = *_11; _13 = _12 + j_26; *_11 = _13; i_15 = i_2 + 1; if (max_7 >= i_15) goto ; else goto ; : goto ; : //bb 5 is only predecessor _19 = (unsigned int) i_25; _18 = (unsigned int) max_7; _17 = (unsigned int) i_25; _5 = _18 - _17; _4 = _5 + _19; _3 = _4 + 1; i_21 = (int) _3; : # i_23 = PHI //tests outer loop note bb7 use i_25, not i_2; so neither i_15 nor i_2 escape the loop, and we don't have the problem from above. (Yes bb7 is taking i_25 away from max_7 and then adding it back on again, before adding 1, to give the value of i after the inner loop.) This arrangement of multiple i's live at the same time, is not present in 107t.ch2. 130t.loopinit introduces i_21, computed by an exit phi on leaving the inner loop. 135t.sccp then changes this to the max_7-i_25+i_25 sequence which removes the dependency on i_15 and allows vectorization.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #9 from alalaw01 at gcc dot gnu.org --- In analyze_access_subtree (since r147980, "New implementation of SRA", 2009): else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL) root->grp_unscalarized_data = 1; /* not covered and written to */ adding a case for constant_decl_p alongside the PARM_DECL case, fixes the ICE; AArch64 bootstrap in progress.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #7 from alalaw01 at gcc dot gnu.org --- *second* half, sorry. grp_to_be_replaced is here true, but grp_unscalarized_data is false, so handle_unscalarized_data_in_subtree sets sad->refreshed=UDH_LEFT and we build the access to the LHS. (Then, load_assign_lhs_subreplacements exits, and the caller sees UDH_LEFT and removes the original block move statement.) In contrast, on a similar testcase using a parameter rather than *.LC0, grp_unscalarized_data is true, handle_unscalarized_data_in_subtree sets sad->refreshed=UDH_RIGHT and we build an access to the RHS, which is OK; and leave the block move statement in place, hence correctness.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #6 from alalaw01 at gcc dot gnu.org --- Ugh, initializing the scalar replacement for the first half of d, with a value read from the first half of d (should be from the first half of *.LC0).
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #5 from alalaw01 at gcc dot gnu.org --- Prior to SRA, we have d = *.LC0; d$0$f0_7 = MEM[(struct S0[2] *)&*.LC0].f0; e$f0_9 = MEM[(struct S0[2] *)&d + 3B].f0; _3 = (int) d$0$f0_7; c = _3; _5 = (int) e$f0_9; __builtin_printf ("%x\n", _5); sra_modify_assign for d=*.LC0 ends up in load_assign_lhs_subreplacements, where d has two children; the second is grp_to_be_replaced, but because we did not completely_scalarize LC0, there is an access to only the first half of *.LC0, and no corresponding RHS for the second half of d ('racc = find_access_in_subtree (sad->top_racc, offset, lacc->size' returns null). So we generate the bad d$3$f0_14 = MEM[(struct S0[2] *)&d + 3B].f0; that is, initializing the scalar replacement for the second half of d, with a value read from the first half of d.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #4 from alalaw01 at gcc dot gnu.org --- Hmmm. First thing I notice is that the type of d (struct S0[2]) is not scalarizable_type_p, but passes type_internals_preclude_sra_p. Changing the latter to bail out on DECL_BIT_FIELD (as the former does) fixes the ICE, but I'm not yet sure we want to do that.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #87 from alalaw01 at gcc dot gnu.org --- Great, many thanks for the tests, I was worried if we had hit another distinct issue. (Of course this would be better on gcc-patches!)
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #84 from alalaw01 at gcc dot gnu.org --- Bah. Do you normally use -fno-aggressive-loop-optimizations? With -funknown-commons, did you try with/out aggressive loop opts? Powerpc{,64}{be,le} ? The unknown-commons testcase I included in that patch looks to pass on powerpc64le-unknown-linux-gnu. Does HJ Lu's spec source-patching work on powerpc following r232559? I am not a lawyer...but I don't think the SPEC2006 license allows me to upload onto the GCC Compile Farm and runspec. So if you could narrow down to an object file that's broken with a recent compiler and -funknown-commons, with the rest compiled with a gcc prior to r232508, that'd be very helpful - then I could see what assembly I'm changing (and what expressions equal_mem_array_ref is falsely declaring equivalent)...?
[Bug bootstrap/60632] ICE in regcprop.c (copyprop_hardreg_forward_1)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60632 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|WAITING |RESOLVED CC||alalaw01 at gcc dot gnu.org Resolution|--- |WORKSFORME --- Comment #2 from alalaw01 at gcc dot gnu.org --- Sorry, no idea...
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #82 from alalaw01 at gcc dot gnu.org --- For those who haven't seen it, I've put forward this patch on the mailing list: https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01746.html based on a suggestion from Jakub. (Unlike Richi's comment72 patch, this fixes 416.gamess on AArch64.)
[Bug tree-optimization/65963] Missed vectorization of loads strided with << when equivalent * succeeds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #5 from alalaw01 at gcc dot gnu.org --- Can I class this as fixed?
[Bug middle-end/66877] [6 Regression] FAIL: gcc.dg/vect/vect-over-widen-3-big-array.c -flto -ffat-lto-objects scan-tree-dump-times vect "vect_recog_over_widening_pattern: detected" 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66877 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #8 from alalaw01 at gcc dot gnu.org --- Fix committed r232720.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #79 from alalaw01 at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #78) > > That would pessimize it too much IMHO. I'm not sure how to evaluate the pessimization, given it's thought to be a widespread pseudo-FORTRAN construct; so I probably have to defer to your judgement here. However... Given maxsize of an array as two elements, say, would the compiler not be entitled to optimize an index selection down to, say, computing only the LSBit of the actual index? Whereas 'unknown' means, well, exactly what is the case. So I fear this is storing problems up for the future. Is the concern that we can't hide this behind an option, as that would "drive people away from gfortran" ? If that's the case, can we hide it behind an option that defaults to pessimization (?? at least for fortran)??
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #77 from alalaw01 at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #72) > > Patch as posted passed bootstrap & regtest. Adjusted according to > comments but not tested otherwise - please somebody throw at > unpatched 416.gamess. Still miscompares on aarch64, I'm afraid. (Both with and without -fno-aggressive-loop-optimizations.) Also where Jakub wrote: > If you want to go this way, I'd at least key it off DECL_COMMON on the decl. > And instead of multiplying max_size by 2 perhaps just add BITS_PER_UNIT? I wonder why you prefer setting such an arbitrary guess at max_size rather than going with -1 which is defined as "unknown" ?
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #53 from alalaw01 at gcc dot gnu.org --- (In reply to Thomas Koenig from comment #44) > I don't have access to SPEC, so I can only guess... Is there maybe an > equivalence involved, something like Turns out the COMMON is accessed via a MEM_REF in a loop, or as a VAR_DECL inside. Go figure! :) (In reply to Dominique d'Humieres from comment #49) > I don't see the point to add yet another option just because "SPEC does not > want to change the invalid Fortran". I think SPEC should be run with the > option(s) causing the problem disabled. Anecdotally I hear from Fortran-using colleagues this may occur in other places too. Moreover, the list of phases using get_ref_base_and_extent, is long; we could end up compiling with an ever-growing -fno-this -fno-that as more and more phases make use of the "bad" analysis results (that is correct by the language spec after all). In this case, there are a few other equivalences found due to the tree-ssa-scopedtables.c changes, that we'd lose with -fno-tree-dominator-opts, too. (In reply to H.J. Lu from comment #52) > >So, there is nothing to fix in GCC? Why isn't this bug closed as invalid? Not everyone wants to patch SPEC sources.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #43 from alalaw01 at gcc dot gnu.org --- Yeah, I plan to add a fortran-specific option for this, it's easy enough, but I can't run the gfortran testsuite with that, because there are lots of C files in there too, for which the compiler doesn't accept the option... I'm having trouble writing a testcase though. My subroutine with IMPLICIT DOUBLE PRECISION (X) COMMON /MYCOMMON / X(1) produces "mycommon.x" a COMPONENT_REF, but with "mycommon" being a MEM_REF, which requires only the hunk to tree-dfa.c to handle correctly; whereas in SPEC2006, what looks to me to be equivalent FORTRAN, ends up with "mycommon" being a VAR_DECL, which requires the much-bigger patch to the fortran FE... I've very little fortran experience here, any tips? Thanks, Alan
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #39 from alalaw01 at gcc dot gnu.org --- Created attachment 37726 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37726&action=edit Proposed patch (without flag). Here's a prototype patch, that sets TYPE_SIZE to NULL_TREE but leaves DECL_SIZE intact. For the moment I'm applying this universally, rather than gating under a flag, to ease testing check-fortran. Only gfortran.dg/gomp/appendix-a/a.24.1.f90 fails; in practice I think it's OK just to not use the new code in conjunction with -fopenmp. On AArch64, it fixes the 416.gamess issue, and allows compiling 416.gamess without the -fno-aggressive-loop-optimizations previously required. Also bootstraps and passes check-gcc check-fortran check-g++, on aarch64 and x86_64, except as noted above. I expect to add a Fortran-only flag to gate the trans-common.c changes before taking this to gcc-patches@ . The worry is that while many cases in the mid-end were happy with a null TYPE_SIZE, I still had to patch up a couple, so the worry is I might not have got them all. (Indeed, omp-low.c had too many!) I'm not sure this is any worse than adding a new flag to the decl (indicating that the DECL_SIZE is not to be trusted) and then trying to find all the cases where the DECL_SIZE is wrongly relied upon - with the latter approach, the compiler would generate invalid code, rather than "failing fast". Thoughts welcome!
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #37 from alalaw01 at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #36) > As Richard said, you can do similar (invalid too) stuff in C too, say: > struct S { int a[1]; } s; > in one TU and > struct S { int a[1]; } s; > > int > foo (int x) > { > return s.a[x]; > } > > int > bar (int x) > { > return s.a[1 + x] + s.a[0] + s.a[x]; > } > > GCC 5 would compile it to what the author might have meant, while GCC 6 will > optimize bar into s.a[0] * 3; Yes, this was what I meant in comment #33. The question is, do we care? (Or, do we only care in the FORTRAN case?) If so, then we presumably want a -fbroken-common-blocks (or something!) that is not FE-specific.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED |--- --- Comment #33 from alalaw01 at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #31) > > Thus a "fix" for the case where treating a[i] as a[0] is the issue > would be > > Index: gcc/tree-dfa.c > === > --- gcc/tree-dfa.c (revision 233172) > +++ gcc/tree-dfa.c (working copy) > @@ -617,7 +617,11 @@ get_ref_base_and_extent (tree exp, HOST_ >if (maxsize == -1 > && DECL_SIZE (exp) > && TREE_CODE (DECL_SIZE (exp)) == INTEGER_CST) > - maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset; > + { > + maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset; > + if (maxsize == size) > + maxsize = -1; > + } > } >else if (CONSTANT_CLASS_P (exp)) > { So is there a case where we want this for C ? If I declare a struct with a VLA, and access it through a pointer - GCC recognizes the VLA idiom and keeps the accesses. If I access it from a decl, yes we optimize away the out-of-bounds accesses (in FRE, long before we reach the tree-ssa-scopedtables changes). So OK, if I access it from a extern or __attribute__((weak) decl, which I then get the linker to replace with a bigger decl, then I get "wrong" code (it ignores the extra elements in the bigger decl) - but I'd say that was invalid code. So if this is Fortran-only, we probably have to hook off --std=legacy, right?
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #32 from alalaw01 at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #31) > > Thus a "fix" for the case where treating a[i] as a[0] is the issue > would be > > Index: gcc/tree-dfa.c > === > --- gcc/tree-dfa.c (revision 233172) > +++ gcc/tree-dfa.c (working copy) > @@ -617,7 +617,11 @@ get_ref_base_and_extent (tree exp, HOST_ >if (maxsize == -1 > && DECL_SIZE (exp) > && TREE_CODE (DECL_SIZE (exp)) == INTEGER_CST) > - maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset; > + { > + maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset; > + if (maxsize == size) > + maxsize = -1; > + } > } >else if (CONSTANT_CLASS_P (exp)) > { Maybe if we only did that for DECL_COMMONs if -std=legacy was in force? Tho as you say: > but that wouldn't fix the aggressive-loop optimization issue as that is > _not_ looking at DECL_SIZE but at the array types domain. I wonder if we can't get both places looking at the same thing (DECL_SIZE or array type domain), but I haven't looked into that at all.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Resolution|INVALID |FIXED --- Comment #27 from alalaw01 at gcc dot gnu.org --- (In reply to Richard Biener from comment #25) > (In reply to alalaw01 from comment #23) > > Well, this one is not fixed by -fno-aggressive-loop-optimizations. > > No, that just disabled one symptom of the issue at that point in time. > Fixing the issue also fixes this occurance (well, I hope so ;)) So by "fixing the issue" - we mean, making --std=legacy prevent this (as although against the SPEC, colleagues with more FORTRAN knowledge than I suggest this is common)? SPEC seem to be saying they will not change the source: https://www.spec.org/cpu2006/Docs/faq.html#Run.05 As Jakub suggested in comment #13: > So, perhaps we want some flag on the Fortran COMMON decls that would be set > on > COMMON that ends with an array and would tell get_ref_base_and_extent > (and > other spots?) that accesses can be beyond end of the decl? but only if --std=legacy ? ? ? Should I raise a new bug for this, as both this and 53068 are CLOSED?
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Resolution|DUPLICATE |FIXED --- Comment #23 from alalaw01 at gcc dot gnu.org --- Well, this one is not fixed by -fno-aggressive-loop-optimizations.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #20 from alalaw01 at gcc dot gnu.org --- Hmmm, hang on. In unport.fppized.f, shouldn't we be using the 'F2C/GCC COMPILER ON PC RUNNING UNIX (LINUX,BSD386,ETC)' version? In which case X has size (1) everywhere?
[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #10 from alalaw01 at gcc dot gnu.org --- The stores are getting optimized out because equal_mem_array_ref_p considers equal pairs of MEM_REFS like fmcom.x[_168] and fmcom.x[_208] That is, a ARRAY_REF whose first operand is a COMPONENT_REF fmcom.x (of a VAR_DECL and a FIELD_DECL), and whose second operand is an SSA_NAME _168 or _208; I don't see anything obvious to suggest that they should be equal). get_ref_base_and_extent then returns base=fmcom, size=64, max_size=64 (so not a variable-sized access), and offset 0 :-(.
[Bug middle-end/66877] [6 Regression] FAIL: gcc.dg/vect/vect-over-widen-3-big-array.c -flto -ffat-lto-objects scan-tree-dump-times vect "vect_recog_over_widening_pattern: detected" 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66877 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|WAITING |ASSIGNED CC||alalaw01 at gcc dot gnu.org Assignee|rguenth at gcc dot gnu.org |alalaw01 at gcc dot gnu.org --- Comment #7 from alalaw01 at gcc dot gnu.org --- I can test on ARM ;), so taken - https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01727.html.
[Bug testsuite/69380] [6 Regression] FAIL: g++.dg/tree-ssa/pr69336.C scan-tree-dump-not optimized "cmap"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69380 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Target|arm-none-eabi powerpc*-*-* |arm-none-eabi powerpc*-*-* ||aarch64*-*-* CC||alalaw01 at gcc dot gnu.org --- Comment #2 from alalaw01 at gcc dot gnu.org --- adding "--param max-sra-scalarization-size-Ospeed=72" makes the testcase pass; or we can XFAIL on arm, AArch64, powerpc (presumably also hppa, alpha, and others). 72 is quite large; thoughts? (This suggests that when we move the logic in gimplify_init_constructor, for pushing stuff out to the constant pool, into tree-sra.c, we may also want to refine it a bit. But in the meantime...)
[Bug tree-optimization/69352] [6 Regression] profiledbootstrap failure with --with-build-config=bootstrap-lto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69352 --- Comment #9 from alalaw01 at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #7) > There are various bugs in the r232508 change. > The > gcc_assert (sz0 == sz1); > gcc_assert (max0 == max1); > gcc_assert (rev0 == rev1); > asserts are clearly bogus, while for compatible type I bet size will be > always the same, maximum size can be arbitrary (it will be either equal to > size, then it is a fixed access, or it will be larger, then it is a variable > access), and the reverse stuff looks weird (e.g. I think the lack of > REF_REVERSE_STORAGE_ORDER testing in operand_equal_p is a bug). For a > variable access, even if you remove the above max{0,1} assert, I think you > would happily equate say a[i] and a[j] ARRAY_REFs, because they have the > same off (likely 0) and max (likely size of array in bits). Another problem > I see in the > return equal_mem_array_ref_p (expr0->ops.single.rhs, > expr1->ops.single.rhs) > || operand_equal_p (expr0->ops.single.rhs, > expr1->ops.single.rhs, 0); > case; under some conditions you decide to hash the MEM_REF/ARRAY_REFs as > MEM_REF , hash of base, offset and size, so you should use the same > conditions to decide if you use equal_mem_array_ref_p or operand_equal_p, Agreed, yes. That would fix the bogus asserts, right - we would then only use equal_mem_array_ref_p if size==max_size. > Plus, I'm not sure in what places this hashing > is used, I'm worried you might hash MEM_REFs with different alias types for > the accesses as equal, which for some uses might be fine, if you are not > trying to replace one with another etc., but for other cases it might lead > to wrong-code. I think it should be OK to ignore differences in alias type for DOM optimizations etc., indeed, which is where this was intended.
[Bug tree-optimization/69336] Constant value not detected
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69336 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #4 from alalaw01 at gcc dot gnu.org --- That looks reasonable, AFAICT get_ref_base_and_extent will deal with anything that is handled_component_p. The same patch enables the optimization on aarch64, with appropriate --param sra-max-scalarization-size-Ospeed to pull the constant-pool entry in.
[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679 --- Comment #40 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Jan 18 12:40:43 2016 New Revision: 232508 URL: https://gcc.gnu.org/viewcvs?rev=232508&root=gcc&view=rev Log: Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c PR target/63679 gcc/: * tree-ssa-scopedtables.c (avail_expr_hash): Hash MEM_REF and ARRAY_REF using get_ref_base_and_extent. (equal_mem_array_ref_p): New. (hashable_expr_equal_p): Add call to previous. gcc/testsuite/: * gcc.dg/tree-ssa/ssa-dom-cse-5.c: New. * gcc.dg/tree-ssa/ssa-dom-cse-6.c: New. * gcc.dg/tree-ssa/ssa-dom-cse-7.c: New. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-scopedtables.c
[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679 --- Comment #39 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Jan 18 12:29:02 2016 New Revision: 232506 URL: https://gcc.gnu.org/viewcvs?rev=232506&root=gcc&view=rev Log: Make SRA scalarize constant-pool loads PR target/63679 gcc/ChangeLog: * tree-sra.c (disqualified_constants, constant_decl_p): New. (sra_initialize): Allocate disqualified_constants. (sra_deinitialize): Free disqualified_constants. (disqualify_candidate): Update disqualified_constants when appropriate. (create_access): Scan for constant-pool entries as we go along. (scalarizable_type_p): Add check against type_contains_placeholder_p. (maybe_add_sra_candidate): Allow constant-pool entries. (load_assign_lhs_subreplacements): Bind debug for constant pool vars. (initialize_constant_pool_replacements): New. (sra_modify_assign): Avoid mangling assignments created by previous, and don't generate writes into constant pool. (sra_modify_function_body): Call initialize_constant_pool_replacements. gcc/testsuite/: * gcc.dg/tree-ssa/sra-17.c: New. * gcc.dg/tree-ssa/sra-18.c: New. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-17.c trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-18.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
[Bug middle-end/68112] [6 Regression] FAIL: gcc.target/i386/avx512ifma-vpmaddhuq-2.c (test for excess errors)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68112 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #5 from alalaw01 at gcc dot gnu.org --- Sorry - I believe this was fixed by r229660 (a reversion of the originating r229437), and should still be fixed following the alternative r229825. Can you (HJ?) please reopen if that is not the case.
[Bug target/69053] [6 Regression] ICE in build_vector_from_val
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69053 --- Comment #9 from alalaw01 at gcc dot gnu.org --- I can confirm that both Richi's patch in comment 6 and my patchlet in comment 3, pass bootstrap + check-gcc on ARM and AArch64, and fix the ICE observed on ARM. (ICE never observed on AArch64.)
[Bug tree-optimization/67682] Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |FIXED --- Comment #3 from alalaw01 at gcc dot gnu.org --- Yes, r230330.
[Bug tree-optimization/69166] [6 Regression] ICE in get_initial_def_for_reduction, at tree-vect-loop.c:4188
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69166 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|RESOLVED|REOPENED Last reconfirmed||2016-01-08 CC||alalaw01 at gcc dot gnu.org Resolution|DUPLICATE |--- Ever confirmed|0 |1 --- Comment #2 from alalaw01 at gcc dot gnu.org --- No, not a dup - 69053 results from a type mismatch/missing conversion building the initial value for a COND_EXPR; this PR is because the 'reduction' is an RDIV_EXPR, which get_initial_def_for_reduction doesn't handle. The testcase invokes undefined behaviour, too (e is not initialized). Moving 'double *e' to be a parameter to fn2 avoids the ICE.
[Bug target/69053] [6 Regression] ICE in build_vector_from_val
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69053 --- Comment #3 from alalaw01 at gcc dot gnu.org --- Well, this fixes it, but I'm not sure it fixes it in the right place... diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index ee32166..bd66aa5 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -4178,7 +4178,9 @@ get_initial_def_for_reduction (gimple *stmt, tree init_val break; } } - init_def = build_vector_from_val (vectype, init_value); + init_def = build_vector_from_val (vectype, + fold_convert (TREE_TYPE (vectype), + init_value)); break; default:
[Bug target/69053] [6 Regression] ICE in build_vector_from_val
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69053 --- Comment #2 from alalaw01 at gcc dot gnu.org --- build_vector_from_val then gets called to build a vector (4) unsigned long, from an int* (which is the right signedness and size, but being a pointer it is not types_compatible_p).
[Bug target/69053] [6 Regression] ICE in build_vector_from_val
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69053 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2016-01-05 CC||alahay01 at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Yes. The r230423 change means x86 now has a reduc_umax_scal_optab for V4DI, causing the loop to be vectorized as a COND_REDUCTION. (It is not vectorized on e.g. AArch64, as that platform has reduc_umax_scal_optabs only for vector modes with smaller elements, not V2DI).
[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #23 from alalaw01 at gcc dot gnu.org --- Yes, difficult. I'm conscious that this is stage 3, and worried about adding too much complexity, especially if we're writing code that we'd eventually drop in favour of a more complete framework later (i.e. in gcc7). I'm inclined against > (I wondered > if load-lanes would require more unrolling we should prefer SLP anyway?). As we've seen cases where load-lanes requires more unrolling but the code is still much better. Likewise your argument against > to query whether _all_ loads need to be permuted with SLP ... > thus if there is a load node which is not permuted then retain the SLP. seems convincing. I think the heuristic in comment 16 handles permutation well enough, and beyond that, sharing (rather than the permutation) then appears to be the critical factor. Unfortunately as you say SLP doesn't really handle sharing yet...so > I fear that to get a better heuristic > than what is proposed we need to push this for example to > vect_make_slp_decision where all instances are built Might be reasonable, but I fear it'd be of dubious benefit without: > and we'd need to gather some sharing data therein. I guess if that were a useful step towards > But then there is only a small step to the point where we could actually > compare SLP vs. non-SLP costs. then there is some justification, but the former feels like too much complexity at this stage - especially to do it well; how much do we really want to gather data on the sharing that exists at present, rather than looking at removing that sharing entirely? I'm thinking of e.g. SLP nodes that are performing the same computations but with different permutations too - shouldn't we be aiming at making permutations into first class citizens/operations, and making SLP trees into DAGs? Longer-term goals, sure... So my instinct is to go with the comment 16 patch, and accept that we take the hit in that last testcase (i.e. the one with the sharing).
[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #21 from alalaw01 at gcc dot gnu.org --- Here's the smallest testcase I could come up with (where SLP gets cancelled, but we end up with fewer st2's than before)...the key seems to be things being used in multiple places. #define N 1024 int in1[N], in2[N]; int out1[N], out2[N]; int w[N]; void foo() { for (int i = 0; i < N; i+=2) { int a = in1[i] & in2[i]; int b = in1[i+1] & in2[i+1]; out1[i] = a; out1[i+1] = b; out2[i] = (a + w[i]) ^ (b+w[i+1]); out2[i+1] = (b + w[i]) ^ (a+w[i+1]); } }
[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #20 from alalaw01 at gcc dot gnu.org --- > Would be nice to have a reduced testcase for this one. Working on it. Sadly it's fortran :( The SLP tree that gets cancelled, is quite big (and quite untreelike, if we could see that - a large portion, 7 nodes, is repeated but with the 2 stmts in each SLP node reversed). "Decided to SLP 2 instances" indeed becomes "Decided to SLP 1 instances", with Unrolling factor 2 both times. In the case where the SLP gets cancelled, several more stmts that would have featured in that tree are marked hybrid. The 'vector inside of loop cost' increases from 180 (with SLP) to 308 (if cancelled), but minimum iters for profitability stays at 3. However, the SLP-cancelled case, outputs a whole extra section note: === scheduling SLP instances === ... note: -->vectorizing SLP node starting from: (one of the loads in the cancelled tree) * 4 ... note: vectorizing stmts using SLP. (Tho I suspect that's a red herring.) Whereas later the non-cancelled case, clearly has an extra 'note: add new stmt: MEM[...] = STORE_LANES'...sounding as if perhaps the SLP finds it can use ST2 opportunistically (??).
[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #18 from alalaw01 at gcc dot gnu.org --- Well, we've seen this patch fix some of the vectorizer performance regressions we've had on some benchmarks. On SPEC...the "SLP cancelled" case triggers all over the place, but in most of those cases, doesn't lead to any codegen difference. (Presumably SLP would have failed anyway for some other reason, e.g. costs, and either we generate load/store-lanes either way, or we still *can't* generate load/store-lanes...). The only sub-benchmark where codegen changes is facerec, where we seem to *lose* st2 rather than gainthis needs more analysis.
[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #13 from alalaw01 at gcc dot gnu.org --- Hmmm, I realize a "definite" codegen improvement was maybe a bad choice of wording. A "substantial" (albeit uncertain!) improvement, may have been more accurate... However, yes it looks like we want that patch (indeed, it still helps even when we up the cost of permute operations and drop the -fno-vect-cost-model) - so thanks, Richard. We'll clean up the testisms in due course. In the longer term, is the issue here, that we aren't comparing costs of SLP vs load-lanes, right? We merely compare the cost of whichever of those vectorization strategies we favour, permutes et al, vs leaving it in scalar code?
[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #10 from alalaw01 at gcc dot gnu.org --- This causes to FAIL the scan-tree-dump-times 'vectorizing stmts using SLP' in slp-perm-{1,2,3,5,6,7,8,11}.c. Looking at the assembler before and after... slp-perm-1.c: this looks a big win; several st3's are generated instead of many stp's, we lose all the tbl's, and many constant-pool entries consisting of 'byte's are removed, with the corresponding ADRP's. The loop is fully unrolled in both cases, and the new code is much shorter (48 instructions rather than 95). slp-perm-2.c: less clear, but looks like an overall win. Loop gets unrolled by factor of 2; each "half" loses a TRN1 and a TRN2 but gains an ORR (move). slp-perm-3.c: Again we lose a load of constants and ADRPs (outside the 4-iteration loop), gaining some MOVIs. With the patchlet, the loop gets fully unrolled, and loses 4*tbl per iteration (!). Still executing 8*mul, 8*mla, 4*add, but dropping the TBLs again makes for a win. slp-perm-5: less clear, but again looks like an overall win. Both loops have been fully unrolled, and the combining of stores doesn't help much (we seem to gain as many moves as we lose stores!). but with the patch, we lose several TBLs and TRNs. Also an MLA becomes a MUL. A side comment would be that if we could 'fix' the register allocation here, to put things into the right place ready for the stN rather than moving it there later, we'd have quite a big win...but that's another issue. Also a recurring theme is that the vec_(load/store)_lanes approach seems to make much better use of movi, rather than pushing things into the constant pool. I haven't really looked into this, it may be fundamental, or just a limitation of our current code for loading immediates. slp-perm-6: some wins from constants, and dropping 8 tbls. slp-perm-7.c: Similarly. slp-perm-8.c: Loop here iterates 4 times, and the ld3/st3 manages to lose us 4*move and 9*tbl per iteration (!); huge improvement. slp-perm-11.c: a 16-iteration loop gets unrolled *2, and now uses an st2, but no load_lanes, just a bunch of ldr's: 10 rather than the original 3(*2). 3 strs become 4 stp's (+st2). Doesn't look like an improvement! However, 7 out of 8 cases look better, in some cases much better. So I'd say that was a definite codegen improvement :).
[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #8 from alalaw01 at gcc dot gnu.org --- Adding a check against BB SLP avoids some regressions caused by bailing out of BB SLP when we can't then do a load/store-lanes.
[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #6 from alalaw01 at gcc dot gnu.org --- Well, I can confirm that the patch generates load-lanes/store-lanes instead of SLP, all over the (vect) testsuite. All execution tests are passing :) so it *may* just be a case of updating a lot of scan-tree-dump tests but we'll need to do at least some performance evaluation, watch this space.
[Bug tree-optimization/68707] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Created attachment 36929 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36929&action=edit tree-vect-details dump (after patch, with SLP)
[Bug tree-optimization/68707] New: testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 Bug ID: 68707 Summary: testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64, arm Created attachment 36928 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36928&action=edit tree-vect-details dump (before patch, with LOAD_LANES) Prior to r230993, O3-pr36098.c (at -O3) was vectorized using a LOAD_LANES / STORE_LANES, resulting in: .L5: ld4 {v4.4s - v7.4s}, [x7], 64 add w4, w4, 1 cmp w3, w4 orr v1.16b, v4.16b, v4.16b orr v2.16b, v5.16b, v5.16b orr v3.16b, v6.16b, v6.16b st3 {v1.4s - v3.4s}, [x6], 48 bhi .L5 each iteration of the outer loop processes a struct of 4 ints, of which the first 3 are copied to a destination. The ld4 nicely gets us four structs with all the elements we want in three registers row-wise (and the elements we don't want in a fourth): struct1 struct2 struct3 struct4 v4.s[0] v4.s[1] v4.s[2] v4.s[3] v5.s[0] v5.s[1] v5.s[2] v5.s[3] v6.s[0] v6.s[1] v6.s[2] v6.s[3] v7.s[0] v7.s[1] v7.s[2] v7.s[3] and st3 stores the desired rows (only) to the right locations. Following r230993, instead the loop gets unrolled four times, four vectors are loaded sequentially, and then permuted by SLP: .L5: ldr q0, [x5, 16] add x4, x4, 48 ldr q1, [x5, 32] add w6, w6, 1 ldr q4, [x5, 48] cmp w3, w6 ldr q2, [x5], 64 orr v3.16b, v0.16b, v0.16b orr v5.16b, v4.16b, v4.16b orr v4.16b, v1.16b, v1.16b tbl v0.16b, {v0.16b - v1.16b}, v6.16b tbl v2.16b, {v2.16b - v3.16b}, v7.16b tbl v4.16b, {v4.16b - v5.16b}, v16.16b str q0, [x4, -32] str q2, [x4, -48] str q4, [x4, -16] bhi .L5 that is, we load struct1 struct2 struct3 struct4 v2.s[0] v0.s[0] v1.s[0] v4.s[0] v2.s[1] v0.s[1] v1.s[1] v4.s[1] v2.s[2] v0.s[2] v1.s[2] v4.s[2] v2.s[3] v0.s[3] v1.s[3] v4.s[3] and then permute struct1 struct2 struct3 struct4 v2.s[0] v2.s[3] v0.s[2] v4.s[1] v2.s[1] v0.s[0] v0.s[3] v4.s[2] v2.s[2] v0.s[1] v4.s[0] v4.s[3] so we then have the data 'columnwise' and store each sequentially.
[Bug tree-optimization/68681] New: testcase gcc.dg/vect/pr45752.c fails on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68681 Bug ID: 68681 Summary: testcase gcc.dg/vect/pr45752.c fails on AArch64 Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64 Created attachment 36900 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36900&action=edit tree-vect-details dump Since r231015 (https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03371.html), on AArch64 we have FAIL: gcc.dg/vect/pr45752.c scan-tree-dump-times vect "gaps requires scalar epilogue loop" 0 FAIL: gcc.dg/vect/pr45752.c -flto -ffat-lto-objects scan-tree-dump-times vect "gaps requires scalar epilogue loop" 0 I attach -fdump-tree-vect-details from the non-lto case (line 5379: gcc/testsuite/gcc.dg/vect/pr45752.c:45:3: note: Data access with gaps requires scalar epilogue loop)
[Bug tree-optimization/68549] [6 Regression] ICE: in verify_loop_structure, at cfgloop.c:1669
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68549 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #8 from alalaw01 at gcc dot gnu.org --- Here's another testcase, reduced from value.c in gdb - ICEs at -O2 on (at least) x86_64 and AArch64: typedef long unsigned int size_t; extern void *xmalloc (size_t) __attribute__ ((__malloc__)) __attribute__ ((__returns_nonnull__)); struct __jmp_buf_tag { }; extern int __sigsetjmp (struct __jmp_buf_tag __env[1], int __savemask) __attribute__ ((__nothrow__)); typedef struct __jmp_buf_tag sigjmp_buf[1]; extern sigjmp_buf *exceptions_state_mc_init (void); extern int exceptions_state_mc_action_iter (void); extern void printf_unfiltered (const char *, ...) ; extern struct gdbarch *get_current_arch (void); struct internalvar { struct internalvar *next; }; static struct internalvar *internalvars; struct internalvar * create_internalvar (const char *name) { struct internalvar *var = ((struct internalvar *) xmalloc (sizeof (struct internalvar))); internalvars = var; } void show_convenience () { struct gdbarch *gdbarch = get_current_arch (); int varseen = 0; for (struct internalvar *var = internalvars; var; var = var->next) { if (!varseen) varseen = 1; sigjmp_buf *buf = exceptions_state_mc_init (); __sigsetjmp ( (*buf), 1); while (exceptions_state_mc_action_iter ()) while (exceptions_state_mc_action_iter ()) ; } if (!varseen) printf_unfiltered ( "" ); }
[Bug c/68385] New: ICE building libstdc++ on arm-none-eabi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68385 Bug ID: 68385 Summary: ICE building libstdc++ on arm-none-eabi Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: arm-none-eabi Created attachment 36738 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36738&action=edit Reduced testcase Starting with r230365, building gcc for arm-none-eabi falls over in libstdc++ with: /work/alalaw01/build-arm-none-eabi/obj/gcc2/./gcc/xgcc -shared-libgcc -B/work/alalaw01/build-arm-none-eabi/obj/gcc2/./gcc -nostdinc++ -L/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/src -L/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/src/.libs -L/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/libsupc++/.libs -B/work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/bin/ -B/work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/lib/ -isystem /work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/include -isystem /work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/sys-include -I/work/alalaw01/src/gcc/libstdc++-v3/../libgcc -I/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/include/arm-none-eabi -I/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/include -I/work/alalaw01/src/gcc/libstdc++-v3/libsupc++ -fno-implicit-templates -Wall -Wextra -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections -fdata-sections -frandom-seed=eh_personality.lo -O2 -g -c /work/alalaw01/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc -o eh_personality.o /work/alalaw01/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc: In function '_Unwind_Reason_Code __cxxabiv1::__gxx_personality_v0(_Unwind_State, _Unwind_Control_Block*, _Unwind_Context*)': /work/alalaw01/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc:394:26: internal compiler error: tree check: expected integer_cst, have nop_expr in decompose, at tree.h:5123 UNWIND_STACK_REG)) ^ 0xf8d589 tree_check_failed(tree_node const*, char const*, int, char const*, ...) /work/alalaw01/src/gcc/gcc/tree.c:9587 0x10df3fd tree_check /work/alalaw01/src/gcc/gcc/tree.h:3212 0x10df3fd wi::int_traits::decompose(long*, unsigned int, tree_node const*) /work/alalaw01/src/gcc/gcc/tree.h:5123 0x10df3fd wide_int_ref_storage /work/alalaw01/src/gcc/gcc/wide-int.h:936 0x10df3fd generic_wide_int /work/alalaw01/src/gcc/gcc/wide-int.h:714 0x10df3fd generic_simplify_172 /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:6142 0x1113507 generic_simplify_EQ_EXPR /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:22841 0x111d719 generic_simplify(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:25312 0xa182c8 fold_binary_loc(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/src/gcc/gcc/fold-const.c:9138 0xa227b2 fold_build2_stat_loc(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/src/gcc/gcc/fold-const.c:12333 0x10e00cd generic_simplify_46 /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:2014 0x1112b27 generic_simplify_EQ_EXPR /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:22441 0x111d719 generic_simplify(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:25312 0xa182c8 fold_binary_loc(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/src/gcc/gcc/fold-const.c:9138 0xa3ec75 fold(tree_node*) /work/alalaw01/src/gcc/gcc/fold-const.c:11973 0x5bdff3 build_new_op_1 /work/alalaw01/src/gcc/gcc/cp/call.c:5730 0x5be299 build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, tree_node*, tree_node**, int) /work/alalaw01/src/gcc/gcc/cp/call.c:5803 0x70f42f build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, tree_node*, tree_code, tree_node**, int) /work/alalaw01/src/gcc/gcc/cp/typeck.c:3828 0x6e3b39 cp_parser_binary_expression /work/alalaw01/src/gcc/gcc/cp/parser.c:8621 0x6e3cdc cp_parser_assignment_expression /work/alalaw01/src/gcc/gcc/cp/parser.c:8742 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <http://gcc.gnu.org/bugs.html> for instructions. Reduced testcase attached: $ arm-none-eabi-gcc -c reduced.cc reduced.cc: In function 'bool __gxx_personality_v0(_Unwind_State, _Unwind_Control_Block*, _Unwind_Context*)': re
[Bug tree-optimization/65963] Missed vectorization of loads strided with << when equivalent * succeeds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963 --- Comment #4 from alalaw01 at gcc dot gnu.org --- I confirm the testcase fails execution on armeb-none-eabi (also at -O0), but it does so both with and without the patch to tree-scalar-evolution.c, which did not change codegen (at -O2 -ftree-vectorize; the loop was not vectorized). So this looks to be exposing a different, pre-existing, bug.
[Bug tree-optimization/65963] Missed vectorization of loads strided with << when equivalent * succeeds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963 --- Comment #2 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Thu Nov 5 18:39:38 2015 New Revision: 229825 URL: https://gcc.gnu.org/viewcvs?rev=229825&root=gcc&view=rev Log: [PATCH] tree-scalar-evolution.c: Handle LSHIFT by constant gcc/: PR tree-optimization/65963 * tree-scalar-evolution.c (interpret_rhs_expr): Try to handle LSHIFT_EXPRs as equivalent unsigned MULT_EXPRs. gcc/testsuite/: * gcc.dg/pr68112.c: New. * gcc.dg/vect/vect-strided-shift-1.c: New. Added: trunk/gcc/testsuite/gcc.dg/pr68112.c trunk/gcc/testsuite/gcc.dg/vect/vect-strided-shift-1.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-scalar-evolution.c
[Bug rtl-optimization/68182] ICE in reorder_basic_blocks_simple building libitm/beginend.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68182 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Created attachment 36636 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36636&action=edit Preprocessed source (compressed)
[Bug rtl-optimization/68182] New: ICE in reorder_basic_blocks_simple building libitm/beginend.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68182 Bug ID: 68182 Summary: ICE in reorder_basic_blocks_simple building libitm/beginend.cc Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Host: x86_64 Target: x86_64 Preprocessed source attached; command-line $ /work/alalaw01/build/./gcc/xg++ -B/work/alalaw01/build/./gcc/ -mrtm -O1 -g -m32 -c temp.ii /work/alalaw01/src/gcc/libitm/beginend.cc: In static member function ‘static uint32_t GTM::gtm_thread::begin_transaction(uint32_t, const gtm_jmpbuf*)’: /work/alalaw01/src/gcc/libitm/beginend.cc:400:1: internal compiler error: in operator[], at vec.h:714 } ^ 0x1310783 vec::operator[](unsigned int) /work/alalaw01/src/gcc/gcc/vec.h:714 0x1310783 reorder_basic_blocks_simple /work/alalaw01/src/gcc/gcc/bb-reorder.c:2322 0x1310783 reorder_basic_blocks /work/alalaw01/src/gcc/gcc/bb-reorder.c:2450 0x1310783 execute /work/alalaw01/src/gcc/gcc/bb-reorder.c:2551
[Bug tree-optimization/56118] Piecewise vector / complex initialization from constants not combined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56118 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #5 from alalaw01 at gcc dot gnu.org --- *** Bug 68165 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/68165] Not constant-folding setting vector element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68165 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE --- Comment #3 from alalaw01 at gcc dot gnu.org --- Seems like a duplicate of 56118 to me. *** This bug has been marked as a duplicate of bug 56118 ***
[Bug tree-optimization/68165] New: Not constant-folding setting vector element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68165 Bug ID: 68165 Summary: Not constant-folding setting vector element Product: gcc Version: 6.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- I believe these two C functions are equivalent: typedef float __attribute__((__vector_size__ (2 * sizeof(float float32x2_t; float32x2_t test_cprop () { float32x2_t vec = {0.0, 0.0}; vec[0] = 3.14f; vec[1] = 2.71f; return vec * ((float32x2_t) { 1.5f, 4.5f }); } float32x2_t test_cprop2 () { float32x2_t vec = {3.14f, 2.71f}; return vec * ((float32x2_t) { 1.5f, 4.5f }); } at -O3 -fdump-tree-optimized, on AArch64: = ;; Function test_cprop (test_cprop, funcdef_no=0, decl_uid=2603, cgraph_uid=0, symbol_order=0) test_cprop () { float32x2_t vec; vector(2) float vec.0_5; float32x2_t _6; : vec = { 0.0, 0.0 }; BIT_FIELD_REF = 3.141049041748046875e+0; BIT_FIELD_REF = 2.7103814697265625e+0; vec.0_5 = vec; _6 = vec.0_5 * { 1.5e+0, 4.5e+0 }; vec ={v} {CLOBBER}; return _6; } ;; Function test_cprop2 (test_cprop2, funcdef_no=1, decl_uid=2607, cgraph_uid=1, symbol_order=1) test_cprop2 () { : return { 4.7103814697265625e+0, 1.219499969482421875e+1 }; } = x86 is identical for test_cprop2, worse in test_cprop: = test_cprop () { float32x2_t vec; vector(2) float vec.0_5; float32x2_t _6; float _8; float _9; float _10; float _11; : vec = { 0.0, 0.0 }; BIT_FIELD_REF = 3.141049041748046875e+0; BIT_FIELD_REF = 2.7103814697265625e+0; vec.0_5 = vec; _8 = BIT_FIELD_REF ; _9 = _8 * 1.5e+0; _10 = BIT_FIELD_REF ; _11 = _10 * 4.5e+0; _6 = {_9, _11}; vec ={v} {CLOBBER}; return _6; } = i.e. we are not understanding the result of assigning to the BIT_FIELD_REF on the whole vector, although we can resolve individual elements: float32x2_t test_cprop3 () { float32x2_t vec = {0.0, 0.0}; vec[0] = 3.14f; vec[1] = 2.71f; return (float32x2_t) {vec[0], vec[1]} * ((float32x2_t) { 1.5f, 4.5f }); } produces = test_cprop3 () { : return { 4.7103814697265625e+0, 1.219499969482421875e+1 }; }
[Bug middle-end/68112] [6 Regression] FAIL: gcc.target/i386/avx512ifma-vpmaddhuq-2.c (test for excess errors)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68112 --- Comment #4 from alalaw01 at gcc dot gnu.org --- Sure, but gcc exploits undefinedness of multiply, so rewriting shift to multiply is not equivalent in the general case :(. One way forward might be to make definedness of overflow a bit finer-grained (either on types, i.e. TYPE_OVERFLOW_DEFINED, or maybe as a property of chrecs?)
[Bug middle-end/68112] [6 Regression] FAIL: gcc.target/i386/avx512ifma-vpmaddhuq-2.c (test for excess errors)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68112 --- Comment #2 from alalaw01 at gcc dot gnu.org --- So (a << CONSTANT) is not equivalent to a * (1<
[Bug tree-optimization/67683] Missed vectorization: shifts of an induction variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683 alalaw01 at gcc dot gnu.org changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=35226 --- Comment #2 from alalaw01 at gcc dot gnu.org --- Is there a way to do this kind of thing other than extending polynomial_chrec's to understand operations other than addition ? Whilst beneficial, that looks to be quite a large task.
[Bug tree-optimization/57558] Loop not vectorized if iteration count could be infinite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57558 --- Comment #4 from alalaw01 at gcc dot gnu.org --- Here's another example, extracted from another benchmark - it vectorizes if INDEX is defined to 'long' but not if INDEX is 'short': #include unsigned char *t_run_test(unsigned char *in, int N) { unsigned char *out = malloc (N); for (unsigned INDEX i = 1; i < (N - 1); i++) out[i] = ((3 * in[i]) - in[i - 1] - in[i + 1]); return out; } However, the -Wunsafe-loop-optimizations doesn't give us anything useful here: (successful case, warning printed) $ aarch64-none-elf-gcc -O3 bmark2.c -DINDEX=long -S -Wunsafe-loop-optimizations -fdump-tree-vect-details=stdout | grep vectorized bmark2.c:7:3: note: === vect_mark_stmts_to_be_vectorized === bmark2.c:7:3: note: loop vectorized bmark2.c:3:16: note: vectorized 1 loops in function. bmark2.c: In function 't_run_test': bmark2.c:3:16: warning: cannot optimize loop, the loop counter may overflow [-Wunsafe-loop-optimizations] unsigned char *t_run_test(unsigned char *in, int N) (unsuccessful case, no warning) $ aarch64-none-elf-gcc -O3 bmark2.c -DINDEX=short -S -Wunsafe-loop-optimizations -fdump-tree-vect-details=stdout | grep vectorized bmark2.c:7:3: note: not vectorized: number of iterations cannot be computed. bmark2.c:3:16: note: vectorized 0 loops in function.
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #2 from alalaw01 at gcc dot gnu.org --- Being stupid here, but why does the outer loop having multiple exits matter - it's the inner loop that should be vectorized? FOO was a macro used to selectively make the test i>max disappear (enabling vectorization) - the two commandlines had -DFOO=0 (vectorizes) and -DFOO=1 (doesn't).
[Bug tree-optimization/67683] New: Missed vectorization: shifts of an induction variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683 Bug ID: 67683 Summary: Missed vectorization: shifts of an induction variable Product: gcc Version: 6.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Blocks: 53947 Target Milestone: --- This testcase: void test (unsigned char *data, int max) { unsigned short val = 0xcdef; for(int i = 0; i < max; i++) { data[i] = (unsigned char)(val & 0xff); val >>= 1; } } does not vectorize on AArch64 or x86_64 at -O3. (I haven't yet looked at whether it's a mid-end deficiency or both back-ends are missing patterns.) Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/67682] New: Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682 Bug ID: 67682 Summary: Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64 This code: void test (int*__restrict a, int*__restrict b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[2]; a[3] = b[3]; a[4] = 0; a[5] = 0; a[6] = 0; a[7] = 0; } is not vectorized; -fdump-tree-slp-details reveals test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int *)a_4( D) + 28B] = 0; test.c:4:13: note: original stmt *a_4(D) = _3; test.c:4:13: note: === vect_slp_analyze_data_ref_dependences === test.c:4:13: note: === vect_slp_analyze_operations === test.c:4:13: note: not vectorized: bad operation in basic block. test.c:4:13: note: * Re-trying analysis with vector size 8 ... test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int *)a_4(D) + 28B] = 0; test.c:4:13: note: original stmt *a_4(D) = _3; test.c:4:13: note: === vect_slp_analyze_data_ref_dependences === test.c:4:13: note: === vect_slp_analyze_operations === test.c:4:13: note: not vectorized: bad operation in basic block. (the failure with vector size 8 is expected, but vector size 4 should succeed) Output is: test: ldp w4, w3, [x1] ldp w2, w1, [x1, 8] stp w4, w3, [x0] stp w2, w1, [x0, 8] stp wzr, wzr, [x0, 16] stp wzr, wzr, [x0, 24] ret Curiously, a similar code but writing elements a[0..3] and a[5..8] (missing out a[4]) is SLP'd, producing superior: test: ldr q0, [x1] moviv1.4s, 0 str q1, [x0, 20] str q0, [x0] ret And similarly for (equivalent to the first): void test (int*__restrict a, int*__restrict b) { for (int i = 0; i < 4; i++) a[i] = b[i]; for (int i = 4; i < 8; i++) a[i] = 0; } producing: test: moviv0.4s, 0 ldp x2, x3, [x1] stp x2, x3, [x0] str q0, [x0, 16] ret
[Bug tree-optimization/67681] New: Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 Bug ID: 67681 Summary: Missed vectorization: induction variable used after loop Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- The inner loop here: void addlog2 (int *data) { int i = 1; for (int j=0; j<=30; j++) { int max = 1 << j; if (FOO && i>max) break; for (; i <= max; i++) data[i] += j; } } does not vectorize if the if(FOO...) is present: $ /work/alalaw01/build-aarch64-none-elf/install/bin/aarch64-none-elf-gcc -S -O2 -ftree-vectorize -fdump-tree-vect-details=stdout loop9b.c -DFOO=1 | grep vectorized loop9b.c:1:6: note: not vectorized: inner-loop count not invariant. loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized === loop9b.c:8:5: note: not vectorized: value used after loop. loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized === loop9b.c:8:5: note: not vectorized: value used after loop. loop9b.c:1:6: note: vectorized 0 loops in function. $ aarch64-none-elf-gcc -S -O2 -ftree-vectorize -fdump-tree-vect-details=stdout loop9b.c -DFOO=0 | grep vectorized loop9b.c:4:3: note: not vectorized: inner-loop count not invariant. loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized === loop9b.c:8:5: note: loop vectorized loop9b.c:1:6: note: vectorized 1 loops in function. Same with -O3. Of course clever analysis could figure out that i>max is never true, but even without that, we should be able to get 'i' back afterwards.
[Bug middle-end/65965] Straight-line memcpy/memset not vectorized when equivalent loop is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65965 --- Comment #4 from alalaw01 at gcc dot gnu.org --- (In reply to Richard Biener from comment #3) > Fixed for GCC 6. Indeed. I note that the same testcase does _not_ SLP/vectorize if I use consecutive indices: void test (int*__restrict a, int*__restrict b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[2]; a[3] = b[3]; a[4] = 0; a[5] = 0; a[6] = 0; a[7] = 0; } loop26a.c:6:13: note: Build SLP failed: different operation in stmt MEM[(int *)a _4(D) + 28B] = 0; loop26a.c:6:13: note: original stmt *a_4(D) = _3; loop26a.c:6:13: note: === vect_slp_analyze_data_ref_dependences === loop26a.c:6:13: note: === vect_slp_analyze_operations === loop26a.c:6:13: note: not vectorized: bad operation in basic block. Worth another bug?
[Bug tree-optimization/67283] GCC regression over inlining of returned structures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283 --- Comment #13 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Fri Sep 18 10:55:11 2015 New Revision: 227901 URL: https://gcc.gnu.org/viewcvs?rev=227901&root=gcc&view=rev Log: completely_scalarize arrays as well as records. gcc/: PR tree-optimization/67283 * tree-sra.c (type_consists_of_records_p): Rename to... (scalarizable_type_p): ...this, add case for ARRAY_TYPE. (completely_scalarize_record): Rename to... (completely_scalarize): ...this, add ARRAY_TYPE case, move some code to: (scalarize_elem): New. (analyze_all_variable_accesses): Follow renamings. gcc/testsuite/: * gcc.dg/tree-ssa/sra-15.c: New. * gcc.dg/tree-ssa/sra-16.c: New. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-15.c trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-16.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
[Bug target/63870] [Aarch64] [ARM] Errors in use of NEON intrinsics are reported incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63870 --- Comment #10 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Tue Sep 8 19:43:39 2015 New Revision: 227557 URL: https://gcc.gnu.org/viewcvs?rev=227557&root=gcc&view=rev Log: ARM/AArch64 Testsuite] Add float16 lane_f16_indices tests PR target/63870 * gcc.target/aarch64/advsimd-intrinsics/vld2_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld2q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld3_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld3q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld4_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld4q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst2_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst2q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst3_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst3q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst4_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst4q_lane_f16_indices_1.c: New. Added: trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld2_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld2q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld3_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld3q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld4_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld4q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst2_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst2q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst3_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst3q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst4_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst4q_lane_f16_indices_1.c Modified: trunk/gcc/testsuite/ChangeLog
[Bug target/67439] ICE: unrecognizable insn compiling arm-fp16 testcases with -march=armv7-a and -mrestrict-it
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67439 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-09-03 CC||alalaw01 at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from alalaw01 at gcc dot gnu.org --- I can reproduce the ICE with -mthumb, both "-mfloat-abi=hard -mfpu=neon" and "-mfloat-abi=soft", but only with -mrestrict-it in both cases. "-mfloat-abi=hard -mfpu=neon-fp16" is OK with and without -mrestrict-it. I note the movhf patterns in vfp.md are only usable with neon-fp16; in other cases, we appear to be using arm32_movhf in arm.md.
[Bug tree-optimization/67283] GCC regression over inlining of returned structures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283 --- Comment #12 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Fri Aug 28 15:04:17 2015 New Revision: 227303 URL: https://gcc.gnu.org/viewcvs?rev=227303&root=gcc&view=rev Log: Revert: completely_scalarize arrays as well as records gcc/: Revert: 2015-08-27 Alan Lawrence PR tree-optimization/67283 * tree-sra.c (type_consists_of_records_p): Rename to... (scalarizable_type_p): ...this, add case for ARRAY_TYPE. (completely_scalarize_record): Rename to... (completely_scalarize): ...this, add ARRAY_TYPE case, move some code to: (scalarize_elem): New. gcc/testsuite/: Revert: 2015-08-27 Alan Lawrence * gcc.dg/tree-ssa/sra-15.c: New. Removed: trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-15.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
[Bug tree-optimization/67283] GCC regression over inlining of returned structures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #8 from alalaw01 at gcc dot gnu.org --- I believe this should now be fixed. Do we want a testcase, and if so is there a good way to scan for the stack usage pattern (as observed in the assembler)? One can scan-assembler times for addq.*%rsp, but fixing the constant 72 seems rather fragile, and I don't see a dejagnu way to scan for the constant being the same in each demoN()... And the case of unions is still not handled!!
[Bug tree-optimization/67283] GCC regression over inlining of returned structures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283 --- Comment #7 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Thu Aug 27 15:40:10 2015 New Revision: 227265 URL: https://gcc.gnu.org/viewcvs?rev=227265&root=gcc&view=rev Log: completely_scalarize arrays as well as records gcc/: PR tree-optimization/67283 * tree-sra.c (type_consists_of_records_p): Rename to... (scalarizable_type_p): ...this, add case for ARRAY_TYPE. (completely_scalarize_record): Rename to... (completely_scalarize): ...this, add ARRAY_TYPE case, move some code to: (scalarize_elem): New. gcc/testsuite/: * gcc.dg/tree-ssa/sra-15.c: New. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-15.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679 --- Comment #37 from alalaw01 at gcc dot gnu.org --- Hmmm, no it's not the hashing - that pretty much ignores all types. It's the comparison in hashable_expr_equal_p, which just uses operand_equal_p, specifically this part (in fold-const.c): case MEM_REF: /* Require equal access sizes, and similar pointer types. We can have incomplete types for array references of variable-sized arrays from the Fortran frontend though. Also verify the types are compatible. */ if (!((TYPE_SIZE (TREE_TYPE (arg0)) == TYPE_SIZE (TREE_TYPE (arg1)) || (TYPE_SIZE (TREE_TYPE (arg0)) && TYPE_SIZE (TREE_TYPE (arg1)) && operand_equal_p (TYPE_SIZE (TREE_TYPE (arg0)), TYPE_SIZE (TREE_TYPE (arg1)), flags))) && types_compatible_p (TREE_TYPE (arg0), TREE_TYPE (arg1)) && ((flags & OEP_ADDRESS_OF) || (alias_ptr_types_compatible_p (TREE_TYPE (TREE_OPERAND (arg0, 1)), TREE_TYPE (TREE_OPERAND (arg1, 1))) && (MR_DEPENDENCE_CLIQUE (arg0) == MR_DEPENDENCE_CLIQUE (arg1)) && (MR_DEPENDENCE_BASE (arg0) == MR_DEPENDENCE_BASE (arg1)) && (TYPE_ALIGN (TREE_TYPE (arg0)) == TYPE_ALIGN (TREE_TYPE (arg1))) specifically, a pointer to int, and a pointer to an array of int, are not alias_ptr_types_compatible_p. (I'm not clear that they should be, either!?)
[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679 --- Comment #35 from alalaw01 at gcc dot gnu.org --- So it should be happening in dom2. On x86, input to dom2 is vect_cst_.9_31 = { 0, 1, 2, 3 }; [...]MEM[(int *)&a] = vect_cst_.9_31; [...]vect__13.3_20 = MEM[(int *)&a]; resulting in: Optimizing statement vect_cst_.9_31 = { 0, 1, 2, 3 }; LKUP STMT vect_cst_.9_31 = { 0, 1, 2, 3 } ASGN vect_cst_.9_31 = { 0, 1, 2, 3 } ... Optimizing statement MEM[(int *)&a] = vect_cst_.9_31; Replaced 'vect_cst_.9_31' with constant '{ 0, 1, 2, 3 }' LKUP STMT MEM[(int *)&a] = { 0, 1, 2, 3 } with .MEM_3(D) LKUP STMT { 0, 1, 2, 3 } = MEM[(int *)&a] with .MEM_3(D) LKUP STMT { 0, 1, 2, 3 } = MEM[(int *)&a] with .MEM_17 2>>> STMT { 0, 1, 2, 3 } = MEM[(int *)&a] with .MEM_17 ... Optimizing statement vect__13.3_20 = MEM[(int *)&a]; LKUP STMT vect__13.3_20 = MEM[(int *)&a] with .MEM_21 FIND: { 0, 1, 2, 3 } Replaced redundant expr 'MEM[(int *)&a]' with '{ 0, 1, 2, 3 }' My version has input to dom2: vect_cst_.8_27 = { 0, 1, 2, 3 }; [...]MEM[(int[8] *)&a] = vect_cst_.8_27; [...]vect__8.3_20 = MEM[(int *)&a]; Optimizing statement vect_cst_.8_27 = { 0, 1, 2, 3 }; LKUP STMT vect_cst_.8_27 = { 0, 1, 2, 3 } ASGN vect_cst_.8_27 = { 0, 1, 2, 3 } ... Optimizing statement MEM[(int[8] *)&a] = vect_cst_.8_27; Replaced 'vect_cst_.8_27' with constant '{ 0, 1, 2, 3 }' LKUP STMT MEM[(int[8] *)&a] = { 0, 1, 2, 3 } with .MEM_3(D) LKUP STMT { 0, 1, 2, 3 } = MEM[(int[8] *)&a] with .MEM_3(D) LKUP STMT { 0, 1, 2, 3 } = MEM[(int[8] *)&a] with .MEM_17 2>>> STMT { 0, 1, 2, 3 } = MEM[(int[8] *)&a] with .MEM_17 ... Optimizing statement vect__8.3_20 = MEM[(int *)&a]; LKUP STMT vect__8.3_20 = MEM[(int *)&a] with .MEM_21 2>>> STMT vect__8.3_20 = MEM[(int *)&a] with .MEM_21 Which looks like MEM[(int *)&a] and MEM[(int[8] *)&a] are hashing differently and hence dom2 is not finding it. Could be that I need my SRA to output something closer to a[1] = 1; where I currently have MEM[(int[8] *)&a + 4B] = 1; but also feel that those two statements hashing differently is not really helpful!
[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #32 from alalaw01 at gcc dot gnu.org --- Is the SRA approach going to work? I have hacked up my SRA so that it generates this: foo () { int sum; int i; const int a[8]; unsigned int i.0_7; int _8; unsigned int i.0_19; : MEM[(int[8] *)&a] = 0; MEM[(int[8] *)&a + 4B] = 1; MEM[(int[8] *)&a + 8B] = 2; MEM[(int[8] *)&a + 12B] = 3; MEM[(int[8] *)&a + 16B] = 4; MEM[(int[8] *)&a + 20B] = 5; MEM[(int[8] *)&a + 24B] = 6; MEM[(int[8] *)&a + 28B] = 7; i.0_19 = 0; if (i.0_19 != 8) goto ; else goto ; : # i_20 = PHI # sum_21 = PHI _8 = a[i_20]; sum_9 = sum_21 + _8; i_10 = i_20 + 1; i.0_7 = (unsigned int) i_10; if (i.0_7 != 8) goto ; else goto ; : # sum_22 = PHI a ={v} {CLOBBER}; return sum_22; } the vectorizer then transforms to: ... : MEM[(int[8] *)&a] = 0; MEM[(int[8] *)&a + 4B] = 1; MEM[(int[8] *)&a + 8B] = 2; MEM[(int[8] *)&a + 12B] = 3; MEM[(int[8] *)&a + 16B] = 4; MEM[(int[8] *)&a + 20B] = 5; MEM[(int[8] *)&a + 24B] = 6; MEM[(int[8] *)&a + 28B] = 7; : # i_20 = PHI <0(2), i_10(4)> # sum_21 = PHI <0(2), sum_9(4)> # ivtmp_19 = PHI <8(2), ivtmp_22(4)> # vectp_a.1_1 = PHI <&a(2), vectp_a.1_2(4)> # vect_sum_9.4_17 = PHI <{ 0, 0, 0, 0 }(2), vect_sum_9.4_23(4)> # ivtmp_27 = PHI <0(2), ivtmp_28(4)> vect__8.3_18 = MEM[(int *)vectp_a.1_1]; _8 = a[i_20]; vect_sum_9.4_23 = vect__8.3_18 + vect_sum_9.4_17; sum_9 = _8 + sum_21; i_10 = i_20 + 1; ivtmp_22 = ivtmp_19 - 1; vectp_a.1_2 = vectp_a.1_1 + 16; ivtmp_28 = ivtmp_27 + 1; if (ivtmp_28 < 2) goto ; else goto ; : goto ; : # sum_7 = PHI # vect_sum_9.4_24 = PHI stmp_sum_9.5_25 = [reduc_plus_expr] vect_sum_9.4_24; vect_sum_9.6_26 = stmp_sum_9.5_25 + 0; a ={v} {CLOBBER}; return vect_sum_9.6_26; } and the optimized tree is: foo () { int vect_sum_9.6; int stmp_sum_9.5; vector(4) int vect_sum_9.4; const vector(4) int vect__8.3; const int a[8]; : MEM[(int[8] *)&a] = { 0, 1, 2, 3 }; MEM[(int[8] *)&a + 16B] = { 4, 5, 6, 7 }; vect__8.3_20 = MEM[(int *)&a]; vect__8.3_18 = MEM[(int *)&a + 16B]; vect_sum_9.4_23 = vect__8.3_18 + vect__8.3_20; stmp_sum_9.5_25 = [reduc_plus_expr] vect_sum_9.4_23; vect_sum_9.6_26 = stmp_sum_9.5_25; a ={v} {CLOBBER}; return vect_sum_9.6_26; } final assembly is: ldr q1, .LC1 sub sp, sp, #32 ldr q0, .LC2 add sp, sp, 32 add v0.4s, v0.4s, v1.4s addvs0, v0.4s umovw0, v0.s[0] ret which is a slight improvement, but not really what we are looking for...
[Bug target/66964] Assembler error during ARM cross compile
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66964 --- Comment #7 from alalaw01 at gcc dot gnu.org --- No new regressions bootstrapping that path on gcc-5-branch (--with-arch=armv7-a --with-fpu=neon-fp16 --with-float=hard). However, compiling the testcase with -dp reveals the bad strd's are actually coming from the *movdf_vfp pattern in vfp.md.
[Bug target/66964] Assembler error during ARM cross compile
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66964 --- Comment #6 from alalaw01 at gcc dot gnu.org --- Bootstrap+test in progress FYI. However, that patch *does not* fix this failure; there must be some other route.
[Bug target/66791] New: Replace builtins with gcc vector extensions code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66791 Bug ID: 66791 Summary: Replace builtins with gcc vector extensions code Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Blocks: 47562 Target Milestone: --- Target: arm Lots of ARM neon intrinsics are implemented using builtins backing onto patterns in neon.md. These are opaque to the midend, but we could rewrite them using equivalent gcc vector operations, that would be transparent to the midend but would still eventually be turned into the same instructions. This would enable more optimization in the midend. Many of the AArch64 intrinsics have been implemented in this way so AArch64 arm_neon.h may provide some useful templates. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562 [Bug 47562] [meta-bug] keep track of Neon Intrinsics enhancements
[Bug target/65956] [5/6 Regression] Another ARM overaligned arg passing issue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65956 --- Comment #6 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Jul 6 17:37:50 2015 New Revision: 225470 URL: https://gcc.gnu.org/viewcvs?rev=225470&root=gcc&view=rev Log: Backport r225466: tests from 'Fix eipa_src AAPCS issue (PR target/65956)' 2015-05-05 Jakub Jelinek PR target/65956 * gcc.c-torture/execute/pr65956.c: New test. Added: branches/gcc-5-branch/gcc/testsuite/gcc.c-torture/execute/pr65956.c Modified: branches/gcc-5-branch/gcc/testsuite/ChangeLog
[Bug target/65956] [5/6 Regression] Another ARM overaligned arg passing issue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65956 --- Comment #5 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Jul 6 17:32:07 2015 New Revision: 225469 URL: https://gcc.gnu.org/viewcvs?rev=225469&root=gcc&view=rev Log: 2015-07-06 Alan Lawrence Backport from mainline r225465 2015-07-06 Alan Lawrence gcc/: PR target/65956 * config/arm/arm.c (arm_needs_doubleword_align): Drop any outer alignment attribute, exploring one level down for records and arrays. gcc/testsuite/: * gcc.target/arm/aapcs/align1.c: New. * gcc.target/arm/aapcs/align_rec1.c: New. * gcc.target/arm/aapcs/align2.c: New. * gcc.target/arm/aapcs/align_rec2.c: New. * gcc.target/arm/aapcs/align3.c: New. * gcc.target/arm/aapcs/align_rec3.c: New. * gcc.target/arm/aapcs/align4.c: New. * gcc.target/arm/aapcs/align_rec4.c: New. * gcc.target/arm/aapcs/align_vararg1.c: New. * gcc.target/arm/aapcs/align_vararg2.c: New. Added: branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align1.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align2.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align3.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align4.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_rec1.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_rec2.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_rec3.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_rec4.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_vaarg1.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/aapcs/align_vaarg2.c Modified: branches/gcc-5-branch/gcc/ChangeLog branches/gcc-5-branch/gcc/config/arm/arm.c branches/gcc-5-branch/gcc/testsuite/ChangeLog
[Bug target/65956] [5/6 Regression] Another ARM overaligned arg passing issue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65956 --- Comment #4 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Jul 6 17:06:00 2015 New Revision: 225466 URL: https://gcc.gnu.org/viewcvs?rev=225466&root=gcc&view=rev Log: Fix eipa_src AAPCS issue (PR target/65956) 2015-05-05 Jakub Jelinek PR target/65956 * gcc.c-torture/execute/pr65956.c: New test. Added: trunk/gcc/testsuite/gcc.c-torture/execute/pr65956.c Modified: trunk/gcc/testsuite/ChangeLog
[Bug target/65956] [5/6 Regression] Another ARM overaligned arg passing issue
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65956 --- Comment #3 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Jul 6 16:58:16 2015 New Revision: 225465 URL: https://gcc.gnu.org/viewcvs?rev=225465&root=gcc&view=rev Log: [ARM] PR/65956 AAPCS update for alignment attribute gcc/: PR target/65956 * config/arm/arm.c (arm_needs_doubleword_align): Drop any outer alignment attribute, exploring one level down for records and arrays. gcc/testsuite/: * gcc.target/arm/aapcs/align1.c: New. * gcc.target/arm/aapcs/align_rec1.c: New. * gcc.target/arm/aapcs/align2.c: New. * gcc.target/arm/aapcs/align_rec2.c: New. * gcc.target/arm/aapcs/align3.c: New. * gcc.target/arm/aapcs/align_rec3.c: New. * gcc.target/arm/aapcs/align4.c: New. * gcc.target/arm/aapcs/align_rec4.c: New. * gcc.target/arm/aapcs/align_vararg1.c: New. * gcc.target/arm/aapcs/align_vararg2.c: New. Added: trunk/gcc/testsuite/gcc.target/arm/aapcs/align1.c trunk/gcc/testsuite/gcc.target/arm/aapcs/align2.c trunk/gcc/testsuite/gcc.target/arm/aapcs/align3.c trunk/gcc/testsuite/gcc.target/arm/aapcs/align4.c trunk/gcc/testsuite/gcc.target/arm/aapcs/align_rec1.c trunk/gcc/testsuite/gcc.target/arm/aapcs/align_rec2.c trunk/gcc/testsuite/gcc.target/arm/aapcs/align_rec3.c trunk/gcc/testsuite/gcc.target/arm/aapcs/align_rec4.c trunk/gcc/testsuite/gcc.target/arm/aapcs/align_vaarg1.c trunk/gcc/testsuite/gcc.target/arm/aapcs/align_vaarg2.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm.c trunk/gcc/testsuite/ChangeLog
[Bug middle-end/65946] Simple loop with if-statement not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65946 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #2 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Thu Jul 2 12:47:31 2015 New Revision: 225311 URL: https://gcc.gnu.org/viewcvs?rev=225311&root=gcc&view=rev Log: gcc/: * tree-pass.h (make_pass_ch_vect): New. * passes.def: Add pass_ch_vect just before pass_if_conversion. * tree-ssa-loop-ch.c (ch_base, pass_ch_vect, pass_data_ch_vect, pass_ch::process_loop_p, pass_ch_vect::process_loop_p, make_pass_ch_vect): New. (pass_ch): Extend ch_base. (pass_ch::execute): Move all but loop_optimizer_init/finalize to... (ch_base::copy_headers): ...here. gcc/testsuite/: * gcc.dg/vect/vect-strided-a-u16-i4.c (main1): Narrow scope of x,y,z,w. * gcc.dg/vect/vect-ifcvt-11.c: New testcase.
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 65946, which changed state. Bug 65946 Summary: Simple loop with if-statement not vectorized https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65946 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug target/64134] (vector float){0, 0, b, a} Uses stores when it does not need to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64134 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from alalaw01 at gcc dot gnu.org --- Fixed by r29.
[Bug tree-optimization/57600] Turn 2 comparisons into 1 with the min
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57600 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #5 from alalaw01 at gcc dot gnu.org --- (In reply to Marc Glisse from comment #2) > > Or do we want to do the transformation always, and maybe have something > later (in RTL?) to undo it if it didn't help? > > Note that in some experiments with more meat in the loop, having > i some optimizations. Can you give an example where it not only doesn't help, but actually hurts? Are they all just because of not seeing analysis properties, i.e. we could get there by realizing a<=min(a,...) and looking far enough to see a
[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952 --- Comment #8 from alalaw01 at gcc dot gnu.org --- (In reply to alalaw01 from comment #7) > (In reply to Richard Biener from comment #6) > > So aarch64 has no DImode vectors? Or just no DImode multiply (but it has a > > DImode vector shift?). > > Yes, the latter. Sorry, aarch64 has a DImode multiply, but no V2DImode multiply; and it has V2DImode shifts.
[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952 --- Comment #7 from alalaw01 at gcc dot gnu.org --- (In reply to Richard Biener from comment #6) > So aarch64 has no DImode vectors? Or just no DImode multiply (but it has a > DImode vector shift?). Yes, the latter.
[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952 --- Comment #5 from alalaw01 at gcc dot gnu.org --- So the above example tends to get fully unrolled, but even on an example with 32 ptrs rather than 4, yes the vectorizer fails because of the multiplication - but the multiplication is gone by the final tree stage, as it's strength reduced down to an add; I believe this -fdump-tree-optimized would be perfectly vectorizable: loop () { unsigned long ivtmp.12; unsigned long ivtmp.10; void * _4; struct my_struct * _7; struct my_struct * pretmp_11; unsigned long _20; : pretmp_11 = array; ivtmp.10_16 = (unsigned long) pretmp_11; ivtmp.12_2 = (unsigned long) &ptrs; _20 = (unsigned long) &MEM[(void *)&ptrs + 256B]; : # ivtmp.10_10 = PHI # ivtmp.12_15 = PHI _7 = (struct my_struct *) ivtmp.10_10; _4 = (void *) ivtmp.12_15; MEM[base: _4, offset: 0B] = _7; ivtmp.10_1 = ivtmp.10_10 + 16; ivtmp.12_14 = ivtmp.12_15 + 8; if (ivtmp.12_14 != _20) goto ; else goto ; : return; }
[Bug tree-optimization/61171] vectorization fails for a reduction in presence of subtraction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61171 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #2 from alalaw01 at gcc dot gnu.org --- This vectorizes fine, if vv is made a local variable: float isOk() { float vv = 0; for (int j=0U; j