[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #16 from Richard Biener --- Fixed. I'll open two enhancement PRs for this testcase.
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #15 from Richard Biener --- Author: rguenth Date: Mon Mar 18 09:17:43 2019 New Revision: 269754 URL: https://gcc.gnu.org/viewcvs?rev=269754&root=gcc&view=rev Log: 2019-03-18 Richard Biener PR target/87561 * config/i386/i386.c (ix86_add_stmt_cost): Pessimize strided loads and stores a bit more. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #14 from Richard Biener --- Author: rguenth Date: Mon Mar 18 09:16:56 2019 New Revision: 269753 URL: https://gcc.gnu.org/viewcvs?rev=269753&root=gcc&view=rev Log: 2019-03-18 Richard Biener PR target/87561 * config/i386/i386.c (ix86_add_stmt_cost): Apply strided load pessimization to stores as well. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #13 from Richard Biener --- 433.milc 9180336 27.4 *9180349 26.3 S 433.milc 9180335 27.4 S9180340 27.0 * 433.milc 9180344 26.7 S9180334 27.5 S 450.soplex 8340225 37.1 *8340223 37.5 S 450.soplex 8340226 36.9 S8340228 36.5 S 450.soplex 8340223 37.4 S8340223 37.3 * 482.sphinx3 19490386 50.5 * 19490392 49.8 S 482.sphinx3 19490384 50.7 S 19490374 52.1 * 482.sphinx3 19490394 49.5 S 19490368 53.0 S comparing the fastest runtimes makes this a progression for both 433.milc and 482.sphinx3 and no difference for 450.soplex. I'll post the patch. For GCC 10 we'd want to play with applying the cost model to the whole loop nest instead.
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #12 from Richard Biener --- So I tested this with a one-off run of SPEC CPU 2006 on a Haswell machine which shows the expected improvement on 416.gamess but also eventual regressions for 433.milc (340s -> 343s), 450.soplex (223s -> 226s) and 482.sphinx3 (383s -> 391s). Re-checking those with a 3-run now.
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #11 from Richard Biener --- Btw, it is exactly the current pessimization of vector construction that makes the AVX256 variant not profitable: 0x40e04e0 *co_99(D)[_53] 1 times vec_construct costs 112 in body that's because we multiply the "real" cost (three inserts, 28) by TYPE_VECTOR_SUBPARTS (four) in x86 add_stmt_cost. For the SSE2 case that results "only" in a factor of two. Changing that "arbitrary" doing into * (TYPE_VECTOR_SUBPARTS + 1) doesn't help. We can add equal handling to catch strided stores but that doesn't help either on its own. Doing both helps not vectorizing though. Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 269683) +++ gcc/config/i386/i386.c (working copy) @@ -50534,14 +50534,15 @@ ix86_add_stmt_cost (void *data, int coun latency and execution resources for the many scalar loads (AGU and load ports). Try to account for this by scaling the construction cost by the number of elements involved. */ - if (kind == vec_construct + if ((kind == vec_construct || kind == vec_to_scalar) && stmt_info - && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type + && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type + || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST) { stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); - stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype); + stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); } if (stmt_cost == -1) stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #10 from Richard Biener --- (In reply to Michael Matz from comment #9) > (In reply to Richard Biener from comment #8) > > > > I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting > > to bogus state). > > Either that or some hack (e.g. artificially avoiding vectorization if > runtime checks are necessary and the loop-nest isn't a box but a pyramid). > Whatever > we do it's better to release GCC with internal bogus state than to release > GCC with a known 10% performance regression (you could revert only on the > release branch so that the regression stays in trunk). So for example we cost 18 stmts in the scalar loop body and 32 stmts in the vector loop body. That's unfortunately still a savings of 4 compared to a vectorization-factor unrolled scalar body. The ratio of vector builds from scalars to other stmts is 6 : 26, if you'd factor in vector decompositions as well it's 8 : 24. Given we've had issues with too eagerly doing strided loads / stores in other cases I'd say a heuristic using that would make more sense than one on runtime alias checks and/or loop nest structure. Btw, I don't think avoding 10% regression in an obsolete benchmark (SPEC 2006) is more important than not feeding garbage into the cost model... (we've never assesed positive results from that change).
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #9 from Michael Matz --- (In reply to Richard Biener from comment #8) > > I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting > to bogus state). Either that or some hack (e.g. artificially avoiding vectorization if runtime checks are necessary and the loop-nest isn't a box but a pyramid). Whatever we do it's better to release GCC with internal bogus state than to release GCC with a known 10% performance regression (you could revert only on the release branch so that the regression stays in trunk).
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #8 from Richard Biener --- Re-checking today we reject AVX vectorization via the costmodel but do SSE vectorization. With versioning for alias we could also SLP vectorize this, keeping the loop body smaller and avoiding an epilogue. Esp. since we're ending up without any vector load or store anyway. Of course SLP analysis requires a grouped store which we do not have since we do not identify XPQKL(MPQ,MKL) and XPQKL(MRS,MKL) as such (they ain't with MPQ == MRS but the runtime alias check ensures that's not the case). That is, we miss "strided group" detection or in general SLP forming via different mechanism. That said, I have a hard time thinking of a heuristic aligning with reality (it's of course possible to come up with a hack). Generally we'd need to work towards doing the versioning / cost model checks on outer loops but the better versioning condition thing would be a prerequesite for this. I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting to bogus state). Scalar inner loop assembly: .L8: vmulsd (%rax,%rdi,8), %xmm3, %xmm0 incl%ecx vfmadd231sd (%rax), %xmm4, %xmm0 vfmadd213sd (%rdx), %xmm6, %xmm0 vmovsd %xmm0, (%rdx) vmulsd (%rax,%r8,8), %xmm1, %xmm0 vfmadd231sd (%rax,%r10,8), %xmm2, %xmm0 addq%r15, %rax vfmadd213sd (%rdx,%rsi,8), %xmm5, %xmm0 vmovsd %xmm0, (%rdx,%rsi,8) addq%rbp, %rdx cmpl%r9d, %ecx jne .L8 vectorized inner loop assembly: .L9: vmovsd (%r10,%rcx), %xmm13 vmovsd (%rdx), %xmm0 incl%r14d vmovhpd (%r10,%rsi), %xmm13, %xmm13 vmovhpd (%rdx,%r13), %xmm0, %xmm14 vmovsd (%rdi,%rcx), %xmm0 vmulpd %xmm9, %xmm13, %xmm13 vmovhpd (%rdi,%rsi), %xmm0, %xmm0 vfmadd132pd %xmm10, %xmm13, %xmm0 vfmadd132pd %xmm12, %xmm14, %xmm0 vmovlpd %xmm0, (%rdx) vmovhpd %xmm0, (%rdx,%r13) vmovsd (%r8,%rcx), %xmm13 vmovsd (%rax), %xmm0 addq%r11, %rdx vmovhpd (%r8,%rsi), %xmm13, %xmm13 vmovhpd (%rax,%r13), %xmm0, %xmm14 vmovsd (%r9,%rcx), %xmm0 addq%rbx, %rcx vmulpd %xmm7, %xmm13, %xmm13 vmovhpd (%r9,%rsi), %xmm0, %xmm0 addq%rbx, %rsi vfmadd132pd %xmm8, %xmm13, %xmm0 vfmadd132pd %xmm11, %xmm14, %xmm0 vmovlpd %xmm0, (%rax) vmovhpd %xmm0, (%rax,%r13) addq%r11, %rax cmpl%r14d, %r15d jne .L9 only outer loop context and knowledge of low trip count makes this bad. The cost modeling doesn't know the scalar loop can execute like if vectorized given the CPUs plenty of resources (speculating non-dependence), whereas the vector variant introduces more constraints to the pipelining due to data dependences from using vectors. But even IACA doesn't tell us the differences are big.
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #7 from Richard Biener --- (In reply to rsand...@gcc.gnu.org from comment #5) > (In reply to Richard Biener from comment #4) > > Another thing is the too complicated alias check where for > > > > (gdb) p debug_data_reference (dr_a.dr) > > #(Data Ref: > > # bb: 14 > > # stmt: _28 = *xpqkl_172(D)[_27]; > > # ref: *xpqkl_172(D)[_27]; > > # base_object: *xpqkl_172(D); > > # Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 + > > offset.34_149) + _480, +, stride.33_148}_6 > > #) > > $9 = void > > (gdb) p debug_data_reference (dr_b.dr) > > #(Data Ref: > > # bb: 14 > > # stmt: *xpqkl_172(D)[_50] = _65; > > # ref: *xpqkl_172(D)[_50]; > > # base_object: *xpqkl_172(D); > > # Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 + > > offset.34_149) + _486, +, stride.33_148}_6 > > #) > > > > we generate > > > > (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 + > > offset.34_149) + (integer(kind=8)) (_19 + jpack_161)) + (sizetype) > > stride.33_148) * 8) < (ssizetype) ((sizetype) integer(kind=8)) mkl_203 + > > 1) * stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) * > > 8) || (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * > > stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) + > > (sizetype) stride.33_148) * 8) < (ssizetype) ((sizetype) > > integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) + > > (integer(kind=8)) (_19 + jpack_161)) * 8) > > > > instead of simply _480 != _486 (well, OK, not _that_ simple). > > > > I guess we miss many of the "optimizations" we do when dealing with > > alias checks for constant steps. In this case sth obvious would be > > to special-case DR_STEP (dra) == DR_STEP (drb). Richard? > Not sure that would help much with the existing optimisations. > I think the closest we get is create_intersect_range_checks_index, > but "all" that avoids is scaling the index by the element size > and adding the common base. I guess the expensive bit here is > multiplying by the stride, but the index-based check would still > do that. > > That said, create_intersect_range_checks_index does feel like it > might be a bit *too* conservative (but I'm not brave enough to relax it) One thing I notice above is that we do (ssizetype) ((sizetype)X * 8) < (ssizetype) ((sizetype)Y * 8) that is, we do a signed comparison but do the multiplication in a type that allows wrapping. I suppose this is an artifact of using DR_OFFSET and friends. Iff dependence analysis which really looks at the access functions iff the base is compatible would be able to return non-constant distance vectors then it would return _231 - _225 as distance which we could runtime-check against the vectorization factor. I suppose that's a feasible trick to try when code-generating the dependence check. Note for 416.gamess it looks like NOC is just 5 but MPQ and MRS are so that there is no runtime aliasing between iterations most of the time (sometimes they are indeed equal). The cost model check skips the vector loop for MK == 2 and 3 and only will execute it for MK == 4 and 5. An alternative for this kind of loop nest would be to cost-model for MK % 2 == 0, thus requiring no epilogue loop. A hack for doing the above is sth like the following which I think would also work for more than one subscript by combining the tests with || I think we need to actually test against the vectorization factor here and we can ignore negative distances unless ddr_reversed, etc., unfortunately compute_affine_dependence frees the subscripts so we cannot compute the "variable" distance vector during dependence analysis and store it away - thus "hack" ;) diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c index 69c5f7b28ae..8973a4557d7 100644 --- a/gcc/tree-data-ref.c +++ b/gcc/tree-data-ref.c @@ -1823,6 +1823,30 @@ create_intersect_range_checks (struct loop *loop, tree *cond_expr, if (create_intersect_range_checks_index (loop, cond_expr, dr_a, dr_b)) return; + auto_vec loop_nest; + bool res = find_loop_nest (loop, &loop_nest); + gcc_assert (res); + ddr_p ddr = initialize_data_dependence_relation (dr_a.dr, dr_b.dr, loop_nest); + if (DDR_SUBSCRIPTS (ddr).length () == 1) +{ + tree fna = SUB_ACCESS_FN (DDR_SUBSCRIPTS (ddr)[0], 0); + tree fnb = SUB_ACCESS_FN (DDR_SUBSCRIPTS (ddr)[0], 1); + tree diff = chrec_fold_minus (TREE_TYPE (fna), fna, fnb); + if (!chrec_contains_undetermined (diff) + && !tree_contains_chrecs (diff, NULL)) + { + free_dependence_relation (ddr); + if (TYPE_UNSIGNED (TREE_TYPE (diff))) + diff = fold_convert (signed_type_for (TREE_TYPE (diff)), diff); + *cond_expr = fold_build2 (GE_EXPR, boolean_type_node, + fold_build1 (ABS_EXPR, +TREE_TYPE (diff), diff), +
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #6 from Richard Biener --- Created attachment 44820 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44820&action=edit reduced testcase Reduced testcase.
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #5 from rsandifo at gcc dot gnu.org --- (In reply to Richard Biener from comment #4) > Another thing is the too complicated alias check where for > > (gdb) p debug_data_reference (dr_a.dr) > #(Data Ref: > # bb: 14 > # stmt: _28 = *xpqkl_172(D)[_27]; > # ref: *xpqkl_172(D)[_27]; > # base_object: *xpqkl_172(D); > # Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 + > offset.34_149) + _480, +, stride.33_148}_6 > #) > $9 = void > (gdb) p debug_data_reference (dr_b.dr) > #(Data Ref: > # bb: 14 > # stmt: *xpqkl_172(D)[_50] = _65; > # ref: *xpqkl_172(D)[_50]; > # base_object: *xpqkl_172(D); > # Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 + > offset.34_149) + _486, +, stride.33_148}_6 > #) > > we generate > > (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 + > offset.34_149) + (integer(kind=8)) (_19 + jpack_161)) + (sizetype) > stride.33_148) * 8) < (ssizetype) ((sizetype) integer(kind=8)) mkl_203 + > 1) * stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) * > 8) || (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * > stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) + > (sizetype) stride.33_148) * 8) < (ssizetype) ((sizetype) > integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) + > (integer(kind=8)) (_19 + jpack_161)) * 8) > > instead of simply _480 != _486 (well, OK, not _that_ simple). > > I guess we miss many of the "optimizations" we do when dealing with > alias checks for constant steps. In this case sth obvious would be > to special-case DR_STEP (dra) == DR_STEP (drb). Richard? Not sure that would help much with the existing optimisations. I think the closest we get is create_intersect_range_checks_index, but "all" that avoids is scaling the index by the element size and adding the common base. I guess the expensive bit here is multiplying by the stride, but the index-based check would still do that. That said, create_intersect_range_checks_index does feel like it might be a bit *too* conservative (but I'm not brave enough to relax it)
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 Richard Biener changed: What|Removed |Added Keywords||missed-optimization CC||rsandifo at gcc dot gnu.org --- Comment #4 from Richard Biener --- Another thing is the too complicated alias check where for (gdb) p debug_data_reference (dr_a.dr) #(Data Ref: # bb: 14 # stmt: _28 = *xpqkl_172(D)[_27]; # ref: *xpqkl_172(D)[_27]; # base_object: *xpqkl_172(D); # Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) + _480, +, stride.33_148}_6 #) $9 = void (gdb) p debug_data_reference (dr_b.dr) #(Data Ref: # bb: 14 # stmt: *xpqkl_172(D)[_50] = _65; # ref: *xpqkl_172(D)[_50]; # base_object: *xpqkl_172(D); # Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) + _486, +, stride.33_148}_6 #) we generate (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) + (integer(kind=8)) (_19 + jpack_161)) + (sizetype) stride.33_148) * 8) < (ssizetype) ((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) * 8) || (ssizetype) (((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) + (sizetype) stride.33_148) * 8) < (ssizetype) ((sizetype) integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) + (integer(kind=8)) (_19 + jpack_161)) * 8) instead of simply _480 != _486 (well, OK, not _that_ simple). I guess we miss many of the "optimizations" we do when dealing with alias checks for constant steps. In this case sth obvious would be to special-case DR_STEP (dra) == DR_STEP (drb). Richard?
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 Richard Biener changed: What|Removed |Added CC||matz at gcc dot gnu.org --- Comment #3 from Richard Biener --- OK, so re-running perf gives me a more reasonable result (-march=native on Haswell): Overhead Samples Command Shared Object Symbol 15.59%754868 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] forms_ 15.55%749452 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] forms_ 10.77%496796 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] twotff_ 7.58%377894 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] dirfck_ 7.57%375587 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] dirfck_ 7.01%328685 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] twotff_ 4.98%243101 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] xyzint_ 4.03%197815 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] xyzint_ with the already noticed loop where there's appearantly not enough iterations warranting the vectorization and the cost model check comes in the way. xyzint_ looks simiar. Note that DO 30 MK=1,NOC DO 30 ML=1,MK MKL = MKL+1 XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) + * VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK)) XPQKL(MRS,MKL) = XPQKL(MRS,MKL) + * VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK)) 30 CONTINUE shows the inner loop will first iterate once, then twice, then ... that makes hoisting the cost model check not possible and also it makes the alias check not invariant in the outer loop. That would mean if we'd code-generate the iteration cost-model then loop splitting might get the idea of splitting the outer loop ... (but loop splitting runs before vectorization of course). So in this very case if we analyze the scalar evolution of the niter of the loop we want to vectorize we get back {0, +, 1}_5 -- that's certainly something we could factor in when computing the vectorization cost. It would increase the prologue/epilogue cost but it wouldn't make vectorization never profitable (we know nothing about the upper bound of the number of iterations).
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 --- Comment #2 from Richard Biener --- OK, so on haswell I see (- is bad, + is good): -0x2342ca0 _40 + _45 1 times scalar_stmt costs 12 in body +0x2342ca0 _40 + _45 1 times scalar_stmt costs 4 in body so a simple add changes cost from 4 to 12 with the patch. Ah, so that goes switch (subcode) { case PLUS_EXPR: case POINTER_PLUS_EXPR: case MINUS_EXPR: if (kind == scalar_stmt) { if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) stmt_cost = ix86_cost->addss; else if (X87_FLOAT_MODE_P (mode)) stmt_cost = ix86_cost->fadd; else stmt_cost = ix86_cost->add; } where with kind == scalar_stmt we now run into the SSE_FLOAT_MODE_P case (previously mode was sth like V2DFmode) and thus use ix86_cost->addss instead of ix86_cost->add. That's more correct. That causes us to (for example) now vectorize mccas.fppized.f:3160 where we previously figured vectorization is never profitable. The look looks like DO 10 MK=1,NOC DO 10 ML=1,MK MKL = MKL+1 XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) + * VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK)) XPQKL(MRS,MKL) = XPQKL(MRS,MKL) + * VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK)) 10 CONTINUE and requires versioning for aliasing and strided loads and strided stores. We're too trigger-happy for doing that it seems. Also the vector version isn't entered at all at runtime. But that's not the 10%. And the big offenders from looking at perf output do not have any vectorization decision changes... very strage.
[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561 Richard Biener changed: What|Removed |Added Target||x86_64-*-*, i?86-*-* Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2018-10-09 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Target Milestone|--- |9.0 Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- Confirmed. I'll have a look.