[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

Segher Boessenkool  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-02-11
 CC||segher at gcc dot gnu.org
 Ever confirmed|0   |1

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-12 Thread drraph at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

--- Comment #1 from Raphael C  ---
After some experimentation (also carried out by Hagen von Eitzen), it seems
that any limit of at least 72 which is also a multiple of 4 causes the same
optimisation problem. That is the loop is *not* optimised out in these cases.  

Perhaps this is an example of one optimisation (SIMD vectorisation) conflicting
with another?

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-12 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

amker at gcc dot gnu.org changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #2 from amker at gcc dot gnu.org ---
(In reply to Raphael C from comment #1)
> After some experimentation (also carried out by Hagen von Eitzen), it seems
> that any limit of at least 72 which is also a multiple of 4 causes the same
> optimisation problem. That is the loop is *not* optimised out in these
> cases.  
> 
> Perhaps this is an example of one optimisation (SIMD vectorisation)
> conflicting with another?

IMHO, this is "scev propagation" on floating point values if ffast-math? is
enabled.  We shouldn't rely on vectorizer to propagate final value, it should
be done somewhere before vectorizer.  Thanks.

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-13 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #3 from Richard Biener  ---
In this case it is complete unrolling that can estimate the non-vector code to
constant fold but not the vectorized code.  OTOH it's quite excessive work done
by the unroller when doing this for large N...

And yes, SCEV final value replacement doesn't know how to handle float
reductions
(we have a different PR for that).

Let's track the unroll cost model issue here (I believe I have partial patches
for that somewhere, so mine).

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-13 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #4 from Jakub Jelinek  ---
(In reply to Richard Biener from comment #3)
> In this case it is complete unrolling that can estimate the non-vector code
> to constant fold but not the vectorized code.  OTOH it's quite excessive
> work done by the unroller when doing this for large N...
> 
> And yes, SCEV final value replacement doesn't know how to handle float
> reductions
> (we have a different PR for that).

Doesn't handle float reductions nor vector (integer or vector) reductions.
Even the vector ones would be useful, if e.g. to a vector every iteration adds
a VECTOR_CST or similar, then it could be still nicely optimized.

For the 202 case, it seems we are generating a scalar loop epilogue (not needed
for 200) and somehow it seems something in the vector is actually able to
figure out the floating point final value, because we get:
  # p_2 = PHI <2.01e+2(5), p_12(7)>
  # i_3 = PHI <200(5), i_13(7)>
on the scalar loop epilogue.  So if something in the vectorizer is able to
figure it out, why can't it just use that even in the case where no epilogue
loop is needed?

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-13 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

--- Comment #5 from amker at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #4)
> (In reply to Richard Biener from comment #3)
> > In this case it is complete unrolling that can estimate the non-vector code
> > to constant fold but not the vectorized code.  OTOH it's quite excessive
> > work done by the unroller when doing this for large N...
> > 
> > And yes, SCEV final value replacement doesn't know how to handle float
> > reductions
> > (we have a different PR for that).
> 
> Doesn't handle float reductions nor vector (integer or vector) reductions.
> Even the vector ones would be useful, if e.g. to a vector every iteration
> adds a VECTOR_CST or similar, then it could be still nicely optimized.
Integer version should have already been supported now.

> 
> For the 202 case, it seems we are generating a scalar loop epilogue (not
> needed for 200) and somehow it seems something in the vector is actually
> able to figure out the floating point final value, because we get:
>   # p_2 = PHI <2.01e+2(5), p_12(7)>
>   # i_3 = PHI <200(5), i_13(7)>
> on the scalar loop epilogue.  So if something in the vectorizer is able to
> figure it out, why can't it just use that even in the case where no epilogue
> loop is needed?
IIUC, scev-ccp should be made query based interface so that it can be called
for each loop closed phi at different compilation stage.  It also needs to be
extended to cover basic floating point case like this.  Effectively, it need to
do the same transformation as vectorizer does now, but just thought it might be
a better place to do that.

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-14 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

--- Comment #6 from rguenther at suse dot de  ---
On Mon, 13 Feb 2017, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> 
> --- Comment #5 from amker at gcc dot gnu.org ---
> (In reply to Jakub Jelinek from comment #4)
> > (In reply to Richard Biener from comment #3)
> > > In this case it is complete unrolling that can estimate the non-vector 
> > > code
> > > to constant fold but not the vectorized code.  OTOH it's quite excessive
> > > work done by the unroller when doing this for large N...
> > > 
> > > And yes, SCEV final value replacement doesn't know how to handle float
> > > reductions
> > > (we have a different PR for that).
> > 
> > Doesn't handle float reductions nor vector (integer or vector) reductions.
> > Even the vector ones would be useful, if e.g. to a vector every iteration
> > adds a VECTOR_CST or similar, then it could be still nicely optimized.
> Integer version should have already been supported now.
> 
> > 
> > For the 202 case, it seems we are generating a scalar loop epilogue (not
> > needed for 200) and somehow it seems something in the vector is actually
> > able to figure out the floating point final value, because we get:
> >   # p_2 = PHI <2.01e+2(5), p_12(7)>
> >   # i_3 = PHI <200(5), i_13(7)>
> > on the scalar loop epilogue.  So if something in the vectorizer is able to
> > figure it out, why can't it just use that even in the case where no epilogue
> > loop is needed?
> IIUC, scev-ccp should be made query based interface so that it can be called
> for each loop closed phi at different compilation stage.  It also needs to be
> extended to cover basic floating point case like this.  Effectively, it need 
> to
> do the same transformation as vectorizer does now, but just thought it might 
> be
> a better place to do that.

Yeah, the vectorizer does this in vect_update_ivs_after_vectorizer
by accident I think - it sees the float "IV" and replaces the prologue
loop init by init + niter * step which is on the border of invalid
(without -ffp-contract=on/fast).  At least if the vectorizer can do this
then final value replacement can do so as well with

Index: gcc/tree-scalar-evolution.c
===
--- gcc/tree-scalar-evolution.c (revision 245417)
+++ gcc/tree-scalar-evolution.c (working copy)
@@ -3718,13 +3718,6 @@ final_value_replacement_loop (struct loo
  continue;
}

-  if (!POINTER_TYPE_P (TREE_TYPE (def))
- && !INTEGRAL_TYPE_P (TREE_TYPE (def)))
-   {
- gsi_next (&psi);
- continue;
-   }
-
   bool folded_casts;
   def = analyze_scalar_evolution_in_loop (ex_loop, loop, def,
  &folded_casts);

(rather than removing the condition replace it with a validity check -
like FP contraction?  etc...).
But ideally SCEV itself would contain those (or compute exact results
with rounding effects).

Like maybe simply

Index: gcc/tree-scalar-evolution.c
===
--- gcc/tree-scalar-evolution.c (revision 245417)
+++ gcc/tree-scalar-evolution.c (working copy)
@@ -3718,8 +3718,10 @@ final_value_replacement_loop (struct loo
  continue;
}

-  if (!POINTER_TYPE_P (TREE_TYPE (def))
- && !INTEGRAL_TYPE_P (TREE_TYPE (def)))
+  if (! (POINTER_TYPE_P (TREE_TYPE (def))
+|| INTEGRAL_TYPE_P (TREE_TYPE (def))
+|| (FLOAT_TYPE_P (TREE_TYPE (def))
+&& flag_fp_contract_mode == FP_CONTRACT_FAST)))
{
  gsi_next (&psi);
  continue;

Richard.

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-14 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

--- Comment #7 from Jakub Jelinek  ---
Shouldn't it (both in the vectorizer and in scev) be dependent not just on
flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
additions can e.g. raise different exceptions and have different roundings from
doing it as just one multiply.

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-14 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

--- Comment #8 from rguenther at suse dot de  ---
On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> 
> --- Comment #7 from Jakub Jelinek  ---
> Shouldn't it (both in the vectorizer and in scev) be dependent not just on
> flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
> additions can e.g. raise different exceptions and have different roundings 
> from
> doing it as just one multiply.

Maybe, but I don't see which (apart from flag_unsafe_math_optimizations
of course but we'd want to get rid of that...).  reassoc uses
flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P
only).  The docs only mention FMA for fp-contract...

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-14 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

--- Comment #9 from Jakub Jelinek  ---
(In reply to rguent...@suse.de from comment #8)
> On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> > 
> > --- Comment #7 from Jakub Jelinek  ---
> > Shouldn't it (both in the vectorizer and in scev) be dependent not just on
> > flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
> > additions can e.g. raise different exceptions and have different roundings 
> > from
> > doing it as just one multiply.
> 
> Maybe, but I don't see which (apart from flag_unsafe_math_optimizations
> of course but we'd want to get rid of that...).  reassoc uses
> flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P
> only).  The docs only mention FMA for fp-contract...

fp-contract is I believe only about whether you are allowed to transform x = a
+ b * c; into a FMA or not (or similar expressions), so I think isn't even
related to this.  Even if we want to get rid of flag_unsafe_math_optimizations
eventually, it is better to use that if we don't know what exactly rather than
emit wrong-code with -fno-fast-math, we'll have to eventually add some new
flags or figure it out.

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-14 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

--- Comment #10 from amker at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #9)
> (In reply to rguent...@suse.de from comment #8)
> > On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> > > 
> > > --- Comment #7 from Jakub Jelinek  ---
> > > Shouldn't it (both in the vectorizer and in scev) be dependent not just on
> > > flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
> > > additions can e.g. raise different exceptions and have different 
> > > roundings from
> > > doing it as just one multiply.
> > 
> > Maybe, but I don't see which (apart from flag_unsafe_math_optimizations
> > of course but we'd want to get rid of that...).  reassoc uses
> > flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P
> > only).  The docs only mention FMA for fp-contract...
> 
> fp-contract is I believe only about whether you are allowed to transform x =
> a + b * c; into a FMA or not (or similar expressions), so I think isn't even
> related to this.  Even if we want to get rid of
> flag_unsafe_math_optimizations eventually, it is better to use that if we
> don't know what exactly rather than emit wrong-code with -fno-fast-math,
> we'll have to eventually add some new flags or figure it out.

I also think we need flag_unsafe_math_optimizations or similar on this, this is
risky transformation and we may put iterative numerical algorithms in danger.

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-14 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

--- Comment #11 from rguenther at suse dot de  ---
On Tue, 14 Feb 2017, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> 
> --- Comment #10 from amker at gcc dot gnu.org ---
> (In reply to Jakub Jelinek from comment #9)
> > (In reply to rguent...@suse.de from comment #8)
> > > On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> > > > 
> > > > --- Comment #7 from Jakub Jelinek  ---
> > > > Shouldn't it (both in the vectorizer and in scev) be dependent not just 
> > > > on
> > > > flag_fp_contract_mode but also on some -ffast-math subflag?  Doing 
> > > > several
> > > > additions can e.g. raise different exceptions and have different 
> > > > roundings from
> > > > doing it as just one multiply.
> > > 
> > > Maybe, but I don't see which (apart from flag_unsafe_math_optimizations
> > > of course but we'd want to get rid of that...).  reassoc uses
> > > flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P
> > > only).  The docs only mention FMA for fp-contract...
> > 
> > fp-contract is I believe only about whether you are allowed to transform x =
> > a + b * c; into a FMA or not (or similar expressions), so I think isn't even
> > related to this.  Even if we want to get rid of
> > flag_unsafe_math_optimizations eventually, it is better to use that if we
> > don't know what exactly rather than emit wrong-code with -fno-fast-math,
> > we'll have to eventually add some new flags or figure it out.
> 
> I also think we need flag_unsafe_math_optimizations or similar on this, this 
> is
> risky transformation and we may put iterative numerical algorithms in danger.

And here's that unroll heuristic patch (cleaned up a bit), it makes us
consider all operations on constants (where SCEV can compute an evolution)
as constant (and actually also folds them in the poor-mans cprop code).
Only works up to 16 iterations of course (thus for the testcase for
an upper bound of 16*4).  Kind of unrelated to this PR of course.

Index: gcc/tree-ssa-loop-ivcanon.c
===
--- gcc/tree-ssa-loop-ivcanon.c (revision 245417)
+++ gcc/tree-ssa-loop-ivcanon.c (working copy)
@@ -157,8 +157,6 @@ struct loop_size
 static bool
 constant_after_peeling (tree op, gimple *stmt, struct loop *loop)
 {
-  affine_iv iv;
-
   if (is_gimple_min_invariant (op))
 return true;

@@ -189,11 +187,9 @@ constant_after_peeling (tree op, gimple
 }

   /* Induction variables are constants.  */
-  if (!simple_iv (loop, loop_containing_stmt (stmt), op, &iv, false))
-return false;
-  if (!is_gimple_min_invariant (iv.base))
-return false;
-  if (!is_gimple_min_invariant (iv.step))
+  tree ev = analyze_scalar_evolution (loop, op);
+  if (chrec_contains_undetermined (ev)
+  || chrec_contains_symbols (ev))
 return false;
   return true;
 }
@@ -1259,7 +1255,7 @@ propagate_constants_for_unrolling (basic

   if (! SSA_NAME_OCCURS_IN_ABNORMAL_PHI (result)
  && gimple_phi_num_args (phi) == 1
- && TREE_CODE (arg) == INTEGER_CST)
+ && CONSTANT_CLASS_P (arg))
{
  replace_uses_by (result, arg);
  gsi_remove (&gsi, true);
@@ -1276,7 +1272,7 @@ propagate_constants_for_unrolling (basic
   tree lhs;

   if (is_gimple_assign (stmt)
- && gimple_assign_rhs_code (stmt) == INTEGER_CST
+ && TREE_CODE_CLASS (gimple_assign_rhs_code (stmt)) == 
tcc_constant
  && (lhs = gimple_assign_lhs (stmt), TREE_CODE (lhs) == SSA_NAME)
  && !SSA_NAME_OCCURS_IN_ABNORMAL_PHI (lhs))
{

[Bug tree-optimization/79460] gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

2017-02-14 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460

--- Comment #12 from Segher Boessenkool  ---
(In reply to Jakub Jelinek from comment #7)
> Shouldn't it (both in the vectorizer and in scev) be dependent not just on
> flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
> additions can e.g. raise different exceptions and have different roundings
> from doing it as just one multiply.

This needs -fassociative-math I think?