On Wed, 31 Jan 2018, Christophe Lyon wrote:
> On 30 January 2018 at 11:47, Jakub Jelinek <[email protected]> wrote:
> > On Tue, Jan 30, 2018 at 11:07:50AM +0100, Richard Biener wrote:
> >>
> >> I have been asked to push this change, fixing (somewhat) the impreciseness
> >> of costing constant/invariant vector uses in SLP stmts. The previous
> >> code always just considered a single constant to be generated in the
> >> prologue irrespective of how many we'd need. With this patch we
> >> properly handle this count and optimize for the case when we can use
> >> a vector splat. It doesn't yet handle CSE (or CSE among stmts) which
> >> means it could in theory regress cases it overall costed correctly
> >> before "optimistically" (aka by accident). But at least the costing
> >> now matches code generation.
> >>
> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. On x86_64
> >> Haswell with AVX2 SPEC 2k6 shows no off-noise changes.
> >>
> >> The patch is said to help the case in the PR when additional backend
> >> costing changes are done (for AVX512).
> >>
> >> Ok for trunk at this stage?
> >
> > LGTM.
> >
> >> 2018-01-30 Richard Biener <[email protected]>
> >>
> >> PR tree-optimization/83008
> >> * tree-vect-slp.c (vect_analyze_slp_cost_1): Properly cost
> >> invariant and constant vector uses in stmts when they need
> >> more than one stmt.
> >
> > Jakub
>
> Hi Richard,
>
> This patch caused a regression on aarch64*:
> FAIL: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1
> (found 2 times)
> we used to have:
> PASS: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1
We now vectorize this on aarch64 - looks like there's a V2SFmode
available. This means we get 1/x computed and divide by {x, x}.
The former is non-optimal because we leave dead code around after
SLP vectorization which the multi-use check of the recip pass
trips on to make this transform profitable.
That's worth a bugreport I think.
For the testcase I'd simply adjust it to pass -fno-slp-vectorize
-- or make sure to run the recip pass before vectorization. Not
sure why it runs before loop optimizations?
Index: gcc/passes.def
===================================================================
--- gcc/passes.def (revision 257233)
+++ gcc/passes.def (working copy)
@@ -263,6 +263,7 @@ along with GCC; see the file COPYING3.
NEXT_PASS (pass_asan);
NEXT_PASS (pass_tsan);
NEXT_PASS (pass_dce);
+ NEXT_PASS (pass_cse_reciprocals);
/* Pass group that runs when 1) enabled, 2) there are loops
in the function. Make sure to run pass_fix_loops before
to discover/remove loops before running the gate function
@@ -317,7 +318,6 @@ along with GCC; see the file COPYING3.
POP_INSERT_PASSES ()
NEXT_PASS (pass_simduid_cleanup);
NEXT_PASS (pass_lower_vector_ssa);
- NEXT_PASS (pass_cse_reciprocals);
NEXT_PASS (pass_sprintf_length, true);
NEXT_PASS (pass_reassoc, false /* insert_powi_p */);
NEXT_PASS (pass_strength_reduction);
puts it right before loop opts and after a DCE pass. This results
in us no longer vectorizing the code:
Vector inside of basic block cost: 4
Vector prologue cost: 4
Vector epilogue cost: 0
Scalar cost of basic block: 6
/space/rguenther/src/svn/early-lto-debug/gcc/testsuite/gcc.dg/cse_recip.c:10:1:
note: not vectorized: vectorization is not profitable.
Not sure if we want to shuffle passes at this stage though.
Richard.