Re: [PATCH] Fix PR83008

Richard Biener Wed, 31 Jan 2018 07:01:52 -0800

On Wed, 31 Jan 2018, Christophe Lyon wrote:

> On 30 January 2018 at 11:47, Jakub Jelinek <[email protected]> wrote:
> > On Tue, Jan 30, 2018 at 11:07:50AM +0100, Richard Biener wrote:
> >>
> >> I have been asked to push this change, fixing (somewhat) the impreciseness
> >> of costing constant/invariant vector uses in SLP stmts.  The previous
> >> code always just considered a single constant to be generated in the
> >> prologue irrespective of how many we'd need.  With this patch we
> >> properly handle this count and optimize for the case when we can use
> >> a vector splat.  It doesn't yet handle CSE (or CSE among stmts) which
> >> means it could in theory regress cases it overall costed correctly
> >> before "optimistically" (aka by accident).  But at least the costing
> >> now matches code generation.
> >>
> >> Bootstrapped and tested on x86_64-unknown-linux-gnu.  On x86_64
> >> Haswell with AVX2 SPEC 2k6 shows no off-noise changes.
> >>
> >> The patch is said to help the case in the PR when additional backend
> >> costing changes are done (for AVX512).
> >>
> >> Ok for trunk at this stage?
> >
> > LGTM.
> >
> >> 2018-01-30  Richard Biener  <[email protected]>
> >>
> >>       PR tree-optimization/83008
> >>       * tree-vect-slp.c (vect_analyze_slp_cost_1): Properly cost
> >>       invariant and constant vector uses in stmts when they need
> >>       more than one stmt.
> >
> >         Jakub
> 
> Hi Richard,
> 
> This patch caused a regression on aarch64*:
> FAIL: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1
> (found 2 times)
> we used to have:
> PASS: gcc.dg/cse_recip.c scan-tree-dump-times optimized "rdiv_expr" 1


We now vectorize this on aarch64 - looks like there's a V2SFmode
available.  This means we get 1/x computed and divide by {x, x}.
The former is non-optimal because we leave dead code around after
SLP vectorization which the multi-use check of the recip pass
trips on to make this transform profitable.

That's worth a bugreport I think.

For the testcase I'd simply adjust it to pass -fno-slp-vectorize
-- or make sure to run the recip pass before vectorization.  Not
sure why it runs before loop optimizations?

Index: gcc/passes.def
===================================================================
--- gcc/passes.def      (revision 257233)
+++ gcc/passes.def      (working copy)
@@ -263,6 +263,7 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_asan);
       NEXT_PASS (pass_tsan);
       NEXT_PASS (pass_dce);
+      NEXT_PASS (pass_cse_reciprocals);
       /* Pass group that runs when 1) enabled, 2) there are loops
         in the function.  Make sure to run pass_fix_loops before
         to discover/remove loops before running the gate function
@@ -317,7 +318,6 @@ along with GCC; see the file COPYING3.
       POP_INSERT_PASSES ()
       NEXT_PASS (pass_simduid_cleanup);
       NEXT_PASS (pass_lower_vector_ssa);
-      NEXT_PASS (pass_cse_reciprocals);
       NEXT_PASS (pass_sprintf_length, true);
       NEXT_PASS (pass_reassoc, false /* insert_powi_p */);
       NEXT_PASS (pass_strength_reduction);

puts it right before loop opts and after a DCE pass.  This results
in us no longer vectorizing the code:

  Vector inside of basic block cost: 4
  Vector prologue cost: 4
  Vector epilogue cost: 0
  Scalar cost of basic block: 6
/space/rguenther/src/svn/early-lto-debug/gcc/testsuite/gcc.dg/cse_recip.c:10:1: 
note: not vectorized: vectorization is not profitable.

Not sure if we want to shuffle passes at this stage though.

Richard.

Re: [PATCH] Fix PR83008

Reply via email to