[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-02-04 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #33 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #32)
> "So I take the other way around by passing the IV's ssa_name into
> scev_probably_wraps_p along call sequence
> "idx_find_step->convert_affine_scev->scev_probably_wraps".  Since the IV's
> ssa_name is tagged with right range information, we can use it when proving
> there is no overflow/wrap in src scev."
> 
> I'm not sure that this is always correct - is the name we ask
> scev_probably_wraps
> on always equal to the IV?  Is it never offsetted?  I think that we need a
> soultion that involves tracking of range information all the way through
> SCEV analysis and thus have it on the CHREC itself.

Yes this is wanted, but I think it's another improvement.  In IVOPT, we don't
need to inherit range information from CHREC from SCEV.  The structure iv has
field ssa_name pointing to the ssa var.  We can use this var's range
information in IVOPT.  Well, the only problem is that field is reset in
record_use.

Moreover, for this PR, it isn't range information of IV's var that we want. 
What we want is the range information of IV's base var.  As in below dump:

  :
  # RANGE [0, 10] NONZERO 15
  # d_26 = PHI 
  # RANGE [0, 9] NONZERO 15
  d_13 = d_26 + -1;
  _14 = A[d_26];
  # RANGE [0, 255] NONZERO 255
  _15 = (int) _14;
  # USE = nonlocal
  # CLB = nonlocal
  foo (_15);
  if (d_13 != 0)
goto ;
  else
goto ;

  :
  goto ;

We really want to take advantage of i_6(D)'s range information, which isn't an
IV.  And it's possible the range information of i_6(D) may not hold in other
part of code other than this loop.
I had patches for last two issues, will send for review later.

> 
> 
> As for the other things - yes, there are multiple places where we lose sign
> information and STEP is just one example (I _think_ I made DR_STEP for
> data-ref analysis forced signed at some point).
> 
> Promoting the array index to a word_mode type sounds interesting, it of
> course
> only makes sense for previously signed types - otherwise you just get another
> conversion in the way.  Of course with GCC you will likely run into the issue
> that this promotion is cancelled by the forced conversion to sizetype, so
> you'll end up with unsigned word_mode again.  Btw - this was kind of my
> suggestion with making get_inner_reference return *POFFSET as ssizetype type,
> not sizetype (callers might not expect this, so for experimenting just
> wrap get_inner_reference in sth converting that back to sizetype for all
> callers but IVOPTs and tree-affine.c (and eventually
> tree-scalar-evolution.c)).

I now don't think this PR is caused by the different type precision issue. 
However it's related to type conversion and how SCEV handles it in GCC.  I need
more study to fully understand the issue, so shouldn't saying too much in case
it's wrong...


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #32 from Richard Biener  ---
"So I take the other way around by passing the IV's ssa_name into
scev_probably_wraps_p along call sequence
"idx_find_step->convert_affine_scev->scev_probably_wraps".  Since the IV's
ssa_name is tagged with right range information, we can use it when proving
there is no overflow/wrap in src scev."

I'm not sure that this is always correct - is the name we ask
scev_probably_wraps
on always equal to the IV?  Is it never offsetted?  I think that we need a
soultion that involves tracking of range information all the way through
SCEV analysis and thus have it on the CHREC itself.


As for the other things - yes, there are multiple places where we lose sign
information and STEP is just one example (I _think_ I made DR_STEP for
data-ref analysis forced signed at some point).

Promoting the array index to a word_mode type sounds interesting, it of course
only makes sense for previously signed types - otherwise you just get another
conversion in the way.  Of course with GCC you will likely run into the issue
that this promotion is cancelled by the forced conversion to sizetype, so
you'll end up with unsigned word_mode again.  Btw - this was kind of my
suggestion with making get_inner_reference return *POFFSET as ssizetype type,
not sizetype (callers might not expect this, so for experimenting just
wrap get_inner_reference in sth converting that back to sizetype for all
callers but IVOPTs and tree-affine.c (and eventually tree-scalar-evolution.c)).


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-29 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #31 from amker at gcc dot gnu.org ---
So cand_value_at (loop, cand, use->stmt, desc->niter, &bnd) with arguments as
below:

  cand->iv->base:
  (unsigned long) ((char *) &A + (sizetype) i_6(D))
  cand->iv->step:
  0x
  desc->niter:
  (unsigned int)(i_6(D) + -1)
  use->stmt:
  is after increment

The result calculated should like below:
  iv->base + iv->step * (unsigned long)niter + step
<=>
  (unsigned long) ((char *) &A + (sizetype) i_6(D))
+
  0x * ((unsigned long)(unsigned int)(i_6(D) + -1))
+
  0x

Even with range information [1, 10] for i_6(D), and we can prove that niter's
range is [0, 9].  We can't prove it equals to:
  (unsigned long)((char *) &A)

This is because we use unsigned type for step and lose the sign information?
And this can't be fixed even with the proper range information.  Is this
understanding correct?  Or anything I should do to achieve that?

Thanks very much.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #30 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #17)
> I really wonder why IVOPTs calls convert_affine_scev with
> !use_overflow_semantics.
> 
> Note that for the original testcase 'i' may be negative or zero and thus 'd'
> may be zero.  We do a bad analysis here because IVOPTs follows complete
> peeling immediately...  but at least we have range information that looks
> useful:
> 
>   :
>   # RANGE [0, 10] NONZERO 15
>   # d_26 = PHI 
>   # RANGE [0, 9] NONZERO 15
>   d_13 = d_26 + -1;
>   _14 = A[d_26];
>   # RANGE [0, 255] NONZERO 255
>   _15 = (int) _14;
>   # USE = nonlocal
>   # CLB = nonlocal
>   foo (_15);
>   if (d_13 != 0)
> goto ;
>   else
> goto ;
> 
>   :
>   goto ;
> 
> but unfortunately we expand the initial value of the IV for d all the way
> to i_6(D) so we don't see that i_6(D) is constrained by the range for d_26.
> 
> So when we are in idx_find_step before we replace *idx with iv->base
> we could check range-information on whether it wrapped.  Hmm, I think
> we can't really compute this.  But we can transfer range information
> (temporarily) from d_26 to iv->base i_6(D) and make use of that in
> scev_probably_wraps_p.  There we currently compute whether
> (unsigned) i_6(D) + 2147483648 (??) > 9 using fold_binary but with
> range information [0, 10] it would compute as false (huh, so what is it
> actually testing?!).  I think the computation of 'delta' should instead
> be adjusted to use range information - max for negative step and min
> for positive step.  Like the following:
> 
> Index: gcc/tree-ssa-loop-niter.c
> ===
> --- gcc/tree-ssa-loop-niter.c   (revision 220038)
> +++ gcc/tree-ssa-loop-niter.c   (working copy)
> @@ -3863,12 +3863,17 @@ scev_probably_wraps_p (tree base, tree s
>   bound of the type, and verify that the loop is exited before this
>   occurs.  */
>unsigned_type = unsigned_type_for (type);
> -  base = fold_convert (unsigned_type, base);
> -
>if (tree_int_cst_sign_bit (step))
>  {
>tree extreme = fold_convert (unsigned_type,
>lower_bound_in_type (type, type));
> +  wide_int min, max;
> +  if (TREE_CODE (base) == SSA_NAME
> + && INTEGRAL_TYPE_P (TREE_TYPE (base))
> + && get_range_info (base, &min, &max) == VR_RANGE)
> +   base = wide_int_to_tree (unsigned_type, max);
> +  else
> +   base = fold_convert (unsigned_type, base);
>delta = fold_build2 (MINUS_EXPR, unsigned_type, base, extreme);
>step_abs = fold_build1 (NEGATE_EXPR, unsigned_type,
>   fold_convert (unsigned_type, step));
> @@ -3877,6 +3882,13 @@ scev_probably_wraps_p (tree base, tree s
>  {
>tree extreme = fold_convert (unsigned_type,
>upper_bound_in_type (type, type));
> +  wide_int min, max;
> +  if (TREE_CODE (base) == SSA_NAME
> + && INTEGRAL_TYPE_P (TREE_TYPE (base))
> + && get_range_info (base, &min, &max) == VR_RANGE)
> +   base = wide_int_to_tree (unsigned_type, min);
> +  else
> +   base = fold_convert (unsigned_type, base);
>delta = fold_build2 (MINUS_EXPR, unsigned_type, extreme, base);
>step_abs = fold_convert (unsigned_type, step);
>  }
> 
> doesn't really help this case unless i_6(D) gets range-information transfered
> temporarily as I said above, of course.

As you said, range information for i_6(D) actually is flow-sensitive
information, we can't simply propagate range_info(d) to i_6(D) generally.  Even
if we can, the code change is not natural since the related functions are
scattered across different functions (ivopt/chrec/niter).
So I take the other way around by passing the IV's ssa_name into
scev_probably_wraps_p along call sequence
"idx_find_step->convert_affine_scev->scev_probably_wraps".  Since the IV's
ssa_name is tagged with right range information, we can use it when proving
there is no overflow/wrap in src scev.

This mothed works and GCC now recognizes the address iv use in A[d].

BUT, the problem not only exists in address type iv's code, but also in compare
type iv's.  The IV dump file is now as below:


  :
  _4 = (sizetype) i_6(D);
  _3 = &A + _4;
  ivtmp.11_17 = (unsigned long) _3;
  _1 = (sizetype) i_6(D);
  _2 = (unsigned int) i_6(D);
  _22 = _2 + 4294967295;
  _21 = (sizetype) _22;
  _20 = _1 - _21;
  _29 = _20 + 18446744073709551615;
  _30 = &A + _29;
  _31 = (unsigned long) _30;

  :
  # ivtmp.11_18 = PHI 
  _5 = (void *) ivtmp.11_18;
  _14 = MEM[base: _5, offset: 0B];
  foo (_14);
  ivtmp.11_8 = ivtmp.11_18 - 1;
  if (ivtmp.11_8 != _31)
goto ;
  else
goto ;

  :
  goto ;

  :
  goto ;

The loop preheader is bloated by computing "cand_value_at (loop, cand,
use->stmt, desc->niter, &bnd);" in function "may_eliminate_iv" and we have:

(gdb) call debug_gener

[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-28 Thread LpSolit at netscape dot net
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #29 from Jiong Wang  ---
(In reply to amker from comment #20)
> (In reply to Richard Biener from comment #18)
> > It's probably not correct to simply transfer range info from *idx to
> > iv->base.
> > Instead SCEV analysis needs to track the range of CHREC_LEFT when it 
> > analyzes
> > the SSA use-def chain.  That's of course a much bigger change :/
> > 
> > The patch may still help in some cases - I suppose the original testcase is
> > reduced from sth else?
> 
> Of course, take range information into consideration is always an
> improvement.

The RANGE info is a good idea, Bin, it's worth a quick exploration.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-27 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #28 from rguenther at suse dot de  ---
On Tue, 27 Jan 2015, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> --- Comment #26 from amker at gcc dot gnu.org ---
> (In reply to Richard Biener from comment #17)
> > I really wonder why IVOPTs calls convert_affine_scev with
> > !use_overflow_semantics.
> I don't understand below code in convert_affine_scev:
> 
>   enforce_overflow_semantics = (use_overflow_semantics
> && nowrap_type_p (type));
> According to comments, 
> 
>"USE_OVERFLOW_SEMANTICS is true if this function should assume that
>the rules for overflow of the given language apply (e.g., that signed
>arithmetics in C does not overflow) -- i.e., to use them to avoid
> unnecessary
>tests, but also to enforce that the result follows them."
> 
> Seems to me we need to enforce overflow check for result if we take 
> advantage of USE_OVERFLOW_SEMANTICS to prove there is no overflow for 
> src.  So shouldn't we set enforce_overflow_semantics according to 
> "nowrap_type_p (TREE_TYPE (*base))", rather than the result type.

Yes, I also wondered about this...

> Also it is noted at the end of function, that we can't use the fact 
> "signed variables do not overflow" when we are checking for result.
>
> But the function is used widespread in scev, there shouldn't be anything so
> wrong.

Heh - I wouldn't count on that.

> > Note that for the original testcase 'i' may be negative or zero and thus 'd'
> > may be zero.  We do a bad analysis here because IVOPTs follows complete
> > peeling immediately...  but at least we have range information that looks
> > useful:
> 
> The case also holds for O2, at this level gcc won't completely unroll 
> the first loop.
> 
> An irrelevant question.  Isn't cunroll too aggressive in GCC?  For cases 
> like this one, the code size is bloated and may hurt Icache performance, 
> while only saving several increment instruction.

Yeah - it was Honza enabling this aggressive peeling.  It makes sense
for a limited amount of code growth (like peeling two iterations) but
indeed using the same limit as for unrolling (where we know intermediate
exits are not taken) doesn't make too much sense...  I wonder if
the size estimates are correctly handling that fact...


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #27 from amker at gcc dot gnu.org ---
(In reply to Jiong Wang from comment #24)
> (In reply to amker from comment #23)
> 
> partially agree.
> 
> at least for the single use case given by Seb, I think tree ivopt should do
> it. (I verified clang do ivopt correctly for the case)

LLVM generates correct code, but I am not sure it's because of ivopt.  The dump
after ivopt for llvm is like below:

; Function Attrs: nounwind
define void @bar(i32 %d) #0 {
entry:
  %A = alloca [10 x i8], align 1
  %cmp2 = icmp sgt i32 %d, 0
  br i1 %cmp2, label %while.body.lr.ph, label %while.end

while.body.lr.ph: ; preds = %entry
  %0 = sext i32 %d to i64
  br label %while.body

while.body:   ; preds = %while.body.lr.ph,
%while.body
  %indvars.iv = phi i64 [ %0, %while.body.lr.ph ], [ %indvars.iv.next,
%while.body ]
  %indvars.iv.next = add nsw i64 %indvars.iv, -1
  %scevgep = getelementptr i8* %A4, i64 %indvars.iv
  %1 = load i8* %scevgep, align 1, !tbaa !1
  tail call void @foo(i8 %1) #2
  %2 = add i64 %indvars.iv.next, 1
  %cmp = icmp sgt i64 %2, 1
  br i1 %cmp, label %while.body, label %while.end.loopexit

The induction variable chosen is the original biv (d) actually, just like GCC.

So even if we fix the idx_find_step issue, GCC's ivopt still can generate below
codes:

Loop-preheader
  ...
Loop-body:
  iv = phi
  tmp = (POINTER_TYPE)&A;
  foo(MEM[base:tmp, index:iv]);

Without proper RTL optimization, very likely the issue in calculation of base
address of A still exists.

> 
> for the rtl re-associate, it's a little bit painful from my experiment
> experiences. as it's not always good to reassociate virtual_frame + offset,
> we can only benefit if it's in loop, because the re-associate will increase
> register pressure, there will be situations that more callee-saved regs
> used, and finally we run into unncessary push/pop in pro/epilogue... and I
> haven't found a good place where we can safely re-use existed rtl info and
> do the rtl re-association as I am afraid rebuild those rtl info will cause
> compile time penalty.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #26 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #17)
> I really wonder why IVOPTs calls convert_affine_scev with
> !use_overflow_semantics.
I don't understand below code in convert_affine_scev:

  enforce_overflow_semantics = (use_overflow_semantics
&& nowrap_type_p (type));
According to comments, 

   "USE_OVERFLOW_SEMANTICS is true if this function should assume that
   the rules for overflow of the given language apply (e.g., that signed
   arithmetics in C does not overflow) -- i.e., to use them to avoid
unnecessary
   tests, but also to enforce that the result follows them."

Seems to me we need to enforce overflow check for result if we take advantage
of USE_OVERFLOW_SEMANTICS to prove there is no overflow for src.  So shouldn't
we set enforce_overflow_semantics according to "nowrap_type_p (TREE_TYPE
(*base))", rather than the result type.  Also it is noted at the end of
function, that we can't use the fact "signed variables do not overflow" when we
are checking for result.

But the function is used widespread in scev, there shouldn't be anything so
wrong.


> 
> Note that for the original testcase 'i' may be negative or zero and thus 'd'
> may be zero.  We do a bad analysis here because IVOPTs follows complete
> peeling immediately...  but at least we have range information that looks
> useful:

The case also holds for O2, at this level gcc won't completely unroll the first
loop.

An irrelevant question.  Isn't cunroll too aggressive in GCC?  For cases like
this one, the code size is bloated and may hurt Icache performance, while only
saving several increment instruction.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #25 from amker at gcc dot gnu.org ---
(In reply to Jiong Wang from comment #24)
> (In reply to amker from comment #23)
> 
> partially agree.
> 
> at least for the single use case given by Seb, I think tree ivopt should do
> it. (I verified clang do ivopt correctly for the case)
> 
> for the rtl re-associate, it's a little bit painful from my experiment
> experiences. as it's not always good to reassociate virtual_frame + offset,
> we can only benefit if it's in loop, because the re-associate will increase
> register pressure, there will be situations that more callee-saved regs
> used, and finally we run into unncessary push/pop in pro/epilogue... and I
> haven't found a good place where we can safely re-use existed rtl info and
> do the rtl re-association as I am afraid rebuild those rtl info will cause
> compile time penalty.

Yes, The ivopt's issue that it doesn't treat "A[d]" as an address type iv use
should be fixed anyway.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread jiwang at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #24 from Jiong Wang  ---
(In reply to amker from comment #23)

partially agree.

at least for the single use case given by Seb, I think tree ivopt should do it.
(I verified clang do ivopt correctly for the case)

for the rtl re-associate, it's a little bit painful from my experiment
experiences. as it's not always good to reassociate virtual_frame + offset, we
can only benefit if it's in loop, because the re-associate will increase
register pressure, there will be situations that more callee-saved regs used,
and finally we run into unncessary push/pop in pro/epilogue... and I haven't
found a good place where we can safely re-use existed rtl info and do the rtl
re-association as I am afraid rebuild those rtl info will cause compile time
penalty.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #23 from amker at gcc dot gnu.org ---
Now I am less convinced that it's a tree ivopt issue.  Tree optimizer has no
knowledge about stack frame information for local array variables.  With the
original test, on 32-bits targets, tree ivopts happens to choose the IV of base
address like below:

  :
  # ivtmp.11_82 = PHI 
  _87 = (void *) ivtmp.11_82;
  _13 = MEM[base: _87, offset: 0B];
  ivtmp.11_83 = ivtmp.11_82 - 1;
  _14 = (int) _13;
  foo (_14);
  if (ivtmp.11_83 != _88)
goto ;
  else
goto ;

  :
  goto ;

IMHO, we shouldn't depend on this.  It's much clearer to consider the exactly
multiple-use example in previous comment, but this time in loop context:

void bar(int i) {
  char A[10];
  char B[10];
  char C[10];
  int d = 0;
  while (i > 0)
A[d++] = i--;

  while (d > 0)
  {
foo(A[d]);
foo(B[d]);
foo(C[d]);
d--;
  }
}

Tree ivopt has no knowledge that references of A/B/C in loop share common
sub-expression.  Most likely (as suggested by previous comments), GCC will
generates below IR:

  :
  _93 = (sizetype) i_6(D);
  _94 = &A + _93;
  ivtmp.11_92 = (unsigned int) _94;
  _98 = (sizetype) i_6(D);
  _99 = _98 + 1;
  _100 = &B + _99;
  ivtmp.15_97 = (unsigned int) _100;
  _104 = (sizetype) i_6(D);
  _105 = _104 + 1;
  _106 = &C + _105;
  ivtmp.20_103 = (unsigned int) _106;
  _110 = (unsigned int) &A;

  :
  # ivtmp.11_90 = PHI 
  # ivtmp.15_95 = PHI 
  # ivtmp.20_101 = PHI 
  _107 = (void *) ivtmp.11_90;
  _13 = MEM[base: _107, offset: 0B];
  ivtmp.11_91 = ivtmp.11_90 - 1;
  _14 = (int) _13;
  foo (_14);
  ivtmp.15_96 = ivtmp.15_95 - 1;
  _108 = (void *) ivtmp.15_96;
  _16 = MEM[base: _108, offset: 0B];
  _17 = (int) _16;
  foo (_17);
  ivtmp.20_102 = ivtmp.20_101 - 1;
  _109 = (void *) ivtmp.20_102;
  _19 = MEM[base: _109, offset: 0B];
  _20 = (int) _19;
  foo (_20);
  if (ivtmp.11_91 != _110)
goto ;
  else
goto ;

  :
  goto ;

It not good at least on targets without auto-increment addressing modes.  Even
on ARM/AARCH64 with auto-increment addressing modes, it's not always practical
because of bloated loop-setup, or register pressure issue caused by duplicated
inducation variables.

In this case, we should associate "virtual_frame + offset1" and hoist it out of
loop, while in loop, we should choose only one inducation variable (the biv),
then refer to A/B/C using offset addressing mode like below:

  
base_1 = virtual_frame + off1

  :
  # d_26 = PHI 
base_2 = base_1 + d_26
d_13 = d_26 - 1;
foo (MEM[base_2, off_A])
foo (MEM[base_2, off_B])
foo (MEM[base_2, off_C])
goto 


So maybe this is a RTL issue, we firstly should do reassociation like
"virtual_frame + reg + offset" to "(virtual_frame + offset) + reg".  As for
missed CSE opportunities in address expressions, maybe it can be fixed by an
additional strength reduction pass on RTL, just like gimple-strength-reduction
pass.  The rtl pass is necessary simply because the gimple one has no knowledge
of stack frame information, just like tree IVOPT.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #22 from rguenther at suse dot de  ---
On Mon, 26 Jan 2015, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> --- Comment #20 from amker at gcc dot gnu.org ---
> (In reply to Richard Biener from comment #18)
> > It's probably not correct to simply transfer range info from *idx to
> > iv->base.
> > Instead SCEV analysis needs to track the range of CHREC_LEFT when it 
> > analyzes
> > the SSA use-def chain.  That's of course a much bigger change :/
> > 
> > The patch may still help in some cases - I suppose the original testcase is
> > reduced from sth else?
> 
> I see it's a tricky problem, and I have to admit that I don't understand it
> very well yet.  The question is, is relax of POINTER_PLUS_EXPR constraint the
> right way fixing this?  I do remember some other PRs (other than this one or
> bug 52563) are caused by this constraint according to your comments.

Well, it's not sure that relaxing POINTER_PLUS_EXPR will help in the 
end...

> Of course, take range information into consideration is always an 
> improvement. 
> Actually I have another possible example in iv elimination.  Curently GCC
> simply rejects elimination of an iv use with a candidate if the cand is of
> smaller type, but as long as we can prove value of iv use can be represented 
> by
> the smaller candidate, elimination is actually safe.  But seems to me, fixing
> this issue with value information is like a side-effect?

Might be - but it's a matter of fixing the observable regression ;)


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #21 from rguenther at suse dot de  ---
On Mon, 26 Jan 2015, ramana at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> --- Comment #19 from Ramana Radhakrishnan  ---
> (In reply to Richard Biener from comment #18)
> > It's probably not correct to simply transfer range info from *idx to
> > iv->base.
> > Instead SCEV analysis needs to track the range of CHREC_LEFT when it 
> > analyzes
> > the SSA use-def chain.  That's of course a much bigger change :/
> > 
> > The patch may still help in some cases - I suppose the original testcase is
> > reduced from sth else?
> 
> Not sure if this is related to comment #c2 where the reference is to a 15%
> regression in bzip2 compress at O3. Sebastian could probably confirm.
> 
> I don't think we can ignore the regression though, can we ?

We should try not to.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #20 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #18)
> It's probably not correct to simply transfer range info from *idx to
> iv->base.
> Instead SCEV analysis needs to track the range of CHREC_LEFT when it analyzes
> the SSA use-def chain.  That's of course a much bigger change :/
> 
> The patch may still help in some cases - I suppose the original testcase is
> reduced from sth else?

I see it's a tricky problem, and I have to admit that I don't understand it
very well yet.  The question is, is relax of POINTER_PLUS_EXPR constraint the
right way fixing this?  I do remember some other PRs (other than this one or
bug 52563) are caused by this constraint according to your comments.

Of course, take range information into consideration is always an improvement. 
Actually I have another possible example in iv elimination.  Curently GCC
simply rejects elimination of an iv use with a candidate if the cand is of
smaller type, but as long as we can prove value of iv use can be represented by
the smaller candidate, elimination is actually safe.  But seems to me, fixing
this issue with value information is like a side-effect?


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #19 from Ramana Radhakrishnan  ---
(In reply to Richard Biener from comment #18)
> It's probably not correct to simply transfer range info from *idx to
> iv->base.
> Instead SCEV analysis needs to track the range of CHREC_LEFT when it analyzes
> the SSA use-def chain.  That's of course a much bigger change :/
> 
> The patch may still help in some cases - I suppose the original testcase is
> reduced from sth else?

Not sure if this is related to comment #c2 where the reference is to a 15%
regression in bzip2 compress at O3. Sebastian could probably confirm.

I don't think we can ignore the regression though, can we ?

regards
Ramana


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #18 from Richard Biener  ---
It's probably not correct to simply transfer range info from *idx to iv->base.
Instead SCEV analysis needs to track the range of CHREC_LEFT when it analyzes
the SSA use-def chain.  That's of course a much bigger change :/

The patch may still help in some cases - I suppose the original testcase is
reduced from sth else?


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #17 from Richard Biener  ---
I really wonder why IVOPTs calls convert_affine_scev with
!use_overflow_semantics.

Note that for the original testcase 'i' may be negative or zero and thus 'd'
may be zero.  We do a bad analysis here because IVOPTs follows complete
peeling immediately...  but at least we have range information that looks
useful:

  :
  # RANGE [0, 10] NONZERO 15
  # d_26 = PHI 
  # RANGE [0, 9] NONZERO 15
  d_13 = d_26 + -1;
  _14 = A[d_26];
  # RANGE [0, 255] NONZERO 255
  _15 = (int) _14;
  # USE = nonlocal
  # CLB = nonlocal
  foo (_15);
  if (d_13 != 0)
goto ;
  else
goto ;

  :
  goto ;

but unfortunately we expand the initial value of the IV for d all the way
to i_6(D) so we don't see that i_6(D) is constrained by the range for d_26.

So when we are in idx_find_step before we replace *idx with iv->base
we could check range-information on whether it wrapped.  Hmm, I think
we can't really compute this.  But we can transfer range information
(temporarily) from d_26 to iv->base i_6(D) and make use of that in
scev_probably_wraps_p.  There we currently compute whether
(unsigned) i_6(D) + 2147483648 (??) > 9 using fold_binary but with
range information [0, 10] it would compute as false (huh, so what is it
actually testing?!).  I think the computation of 'delta' should instead
be adjusted to use range information - max for negative step and min
for positive step.  Like the following:

Index: gcc/tree-ssa-loop-niter.c
===
--- gcc/tree-ssa-loop-niter.c   (revision 220038)
+++ gcc/tree-ssa-loop-niter.c   (working copy)
@@ -3863,12 +3863,17 @@ scev_probably_wraps_p (tree base, tree s
  bound of the type, and verify that the loop is exited before this
  occurs.  */
   unsigned_type = unsigned_type_for (type);
-  base = fold_convert (unsigned_type, base);
-
   if (tree_int_cst_sign_bit (step))
 {
   tree extreme = fold_convert (unsigned_type,
   lower_bound_in_type (type, type));
+  wide_int min, max;
+  if (TREE_CODE (base) == SSA_NAME
+ && INTEGRAL_TYPE_P (TREE_TYPE (base))
+ && get_range_info (base, &min, &max) == VR_RANGE)
+   base = wide_int_to_tree (unsigned_type, max);
+  else
+   base = fold_convert (unsigned_type, base);
   delta = fold_build2 (MINUS_EXPR, unsigned_type, base, extreme);
   step_abs = fold_build1 (NEGATE_EXPR, unsigned_type,
  fold_convert (unsigned_type, step));
@@ -3877,6 +3882,13 @@ scev_probably_wraps_p (tree base, tree s
 {
   tree extreme = fold_convert (unsigned_type,
   upper_bound_in_type (type, type));
+  wide_int min, max;
+  if (TREE_CODE (base) == SSA_NAME
+ && INTEGRAL_TYPE_P (TREE_TYPE (base))
+ && get_range_info (base, &min, &max) == VR_RANGE)
+   base = wide_int_to_tree (unsigned_type, min);
+  else
+   base = fold_convert (unsigned_type, base);
   delta = fold_build2 (MINUS_EXPR, unsigned_type, extreme, base);
   step_abs = fold_convert (unsigned_type, step);
 }

doesn't really help this case unless i_6(D) gets range-information transfered
temporarily as I said above, of course.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2015-01-23 Thread jiwang at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #16 from Jiong Wang  ---
After some work on this issue, things have gone beyond my expectations. To
summarize my understanding of this issue and what I have got:

Reason of regression
==
  the testcase contains a loop which is hotcode, and there is one critical 
loop invariant, actually base address of local array, if it can't be ivopted
then there will be big performance regression on this testcase.

  because of one gcc tree level ivopt issue, this variable will not be 
ivopted on 64bit machine, mips, powerpc, aarch64 etc.

  mostly because there some type precision issues in gcc tree level code that 
gcc is doing too conservative decision on overflow checking and thus abort the
ivopt in early stage. For details, please see above discussion.

  Then why there is regression on AArch64 since

 r213488 "[AArch64] Improve TARGET_LEGITIMIZE_ADDRESS_P hook" ?

  it's because except tree level ivopt pass, gcc also do extra ivopt on
rtl level by the rtl pre pass. previously, we have inefficient implementation
on aarch64 TARGET_LEGITIMIZE_ADDRESS_P while this inefficient implementation
happen to generated some instruction pattern that rtl level ivopt could catch 
and that critical variable finally ivopted by rtl pass.

  r213488 fixed the inefficiency of TARGET_LEGITIMIZE_ADDRESS_P, and as a
side-effect, we will not generate that specific instruction pattern again, thus
no ivopt on rtl pre, and thus regression happen.

  before r213488, ivopt happen on AArch64 only, while after r213488, AArch64
act the same as all other 64bit targets, mips64, powerpc64, sparc64 etc.

To Fix
==
  * the idea way is fix this issue on tree-ssa-loop-ivopt pass.
We need to correct type precision & overflow checking issue, then gcc could
ivopt the critical variable as early as in tree level for all 64bit
targets.

  * gcc rtl level ivopt could be improved also. we could re-shuffle rtl
pattern, then rtl pre pass could catch more ivopt opportunities and
act as a supplement to tree ivopt.

What I have tried
=
  I have tried to fix this issue on rtl level. We need two patches.

* one patch to correct aarch64 TARGET_LEGITIMIZE_ADDRESS hook to generate
  more efficient rtl pattern for array indexed address.

* one patch to re-shuffle rtl pattern to let rtl pre pass catch more ivopt
  opportunities.

  For targets except AArch64, patch 2 alone is enough to let ivopt happen in
rtl
pre pass.

  While for AArch64, we need to correct TARGET_LEGITIMIZE_ADDRESS first, to   
don't split valid "offset" into "magic-base + offset" which make the rtl
sequences unnecessarily complexer and cause troubles for later rtl pass.

  but this fix on AArch64 TARGET_LEGITIMIZE_ADDRESS hook then exposed another
issue on tree-ssa-loop-ivopt. When running benchmarks, I found some post-index
addressing opportunities are missing after we correct
TARGET_LEGITIMIZE_ADDRESS.
The correct of TARGET_LEGITIMIZE_ADDRESS will actually correct address cost
which tree-ssa-loop-ivopt will query, and looks like there are glitches in
tree-ssa-loop-ivopt cost model that gcc ivopt some simple post-index
addressable
variable, and thus generated sub-optimal code.

  And the patch for re-shuffle rtl pattern alone,
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00355.html, has critical problem.
It's unsafe to re-use those REG_NOTE info in those places. So the patch needs
re-work, or we need to
investigate a better solution to fix this on rtl level.

Future work
===
  I hope someone could have a looks at the tree level fix which is the ideal 
fix.

  For RTL level fix, now the issue dependent chain for my solution becomes:

fix-on-rtl-level
  \
   v
  re-shuffle insn for rtl pre
\
 v
fix AArch64 TARGET_LEGITIMIZE_ADDRESS (B)
  \
   v
  fix tree-ssa-loop-ivopt cost model to
  make better decision between ivopt
  and post-index. (A)

I think task A and B are important as they are quite general improvements to
AArch64 and gcc generic code besides this testcase issue.

I'd prepare a testcase and create a seperate ticket for task A first.

And, I don't think we can fix this issue in 5.0 release, so move the target to
the next release?

Thanks.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2014-11-27 Thread jiwang at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #15 from Jiong Wang  ---
(In reply to rguent...@suse.de from comment #14)
> I'd be happy to approve such a change - maybe you can check if
> it helps you and whether it passes testing and enough benchmarks?
> (there is still the related code in pointer_int_sum which should
> be adjusted accordingly)

thanks for these explanation and information. Need some time to digest :)

Sure, I will test it and check benchmarks.


[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2014-11-27 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #14 from rguenther at suse dot de  ---
On Thu, 27 Nov 2014, rguenther at suse dot de wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> --- Comment #13 from rguenther at suse dot de  ---
> On Thu, 27 Nov 2014, jiwang at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> > 
> > Jiong Wang  changed:
> > 
> >What|Removed |Added
> > 
> >  CC||rguenth at gcc dot gnu.org
> >  Depends on||52563
> > 
> > --- Comment #12 from Jiong Wang  ---
> > the root cause why it's not ivopted on AArch64 is because
> > 
> >   must_check_src_overflow = TYPE_PRECISION (ct) < TYPE_PRECISION (type);
> > 
> > code above in convert_affine_scev, the input type is sizetype, so DI for 
> > 64bit
> > arch, SI for 32bit arch, ct is SI, thus, must_check_src_overflow set to true
> > for 64bit arch, then failed later scev_probably_wraps_p check.
> > 
> > And I found similar issue reported back in 2012, at bug 52563.
> > 
> > I verified this bug exist on other 64 archs, like mips64, ppc64, x86-64
> > 
> > Richard, on 52563, I see you was working on this, do you have any thoughts 
> > on
> > this?
> 
> See comment #5 of that bug.  For 4.8(?) I started on work to relax
> the type requirements of the offset parameter of POINTER_PLUS_EXPR
> by abstracting stuff but I didn't get to continue on that work.
> 
> Basically that we force the offset to 'sizetype' has both correctness
> issues (for targets where sizetype precision doesn't match Pmode
> precision) and optimization issues as we lose for example sign
> information and overflow knowledge in the computation of the offset.
> The last thing is also because we have transforms in fold which
> push typecasts of expressions down to operands - thus
> from (sizetype) (4 * i) we get 4 * (sizetype)i which may now be
> an unsigned multiplication with wrapping overflow.  Note that
> it is the frontends who start the conversion thing and apply some
> "tricks" for code-gen (see pointer_int_sum in c-common.c).
> It's also not clear whether if you write p[i] with i of type int
> the multiplication by sizeof (*p) invokes undefined behavior if
> it wraps (that is, the C standard does not define the type the
> multiplication is performed in but just defines things in terms
> of array elements).
> 
> Ideally we'd use a widening multiplication here but optimizers
> have little knowledge of that so it probably would cause quite
> some regressions.  We could also keep the multiplication signed
> (but using ssizetype), but then fold will come along and
> undo that trick IIRC.
> 
> That said, both the POINTER_PLUS_EXPR constraints on the offset
> type _and_ the C language issue with int * sizeof (element)
> overflowing for 64bit pointer sizes prevent us from optimizing this.
> 
> It's a very tricky area ;)

A related part of code to pointer_int_sum is get_inner_reference.
If you do

Index: gcc/expr.c
===
--- gcc/expr.c  (revision 218121)
+++ gcc/expr.c  (working copy)
@@ -6852,9 +6852,10 @@ get_inner_reference (tree exp, HOST_WIDE
   index, low_bound);

offset = size_binop (PLUS_EXPR, offset,
+fold_convert (sizetype,
 size_binop (MULT_EXPR,
-fold_convert (sizetype, 
index),
-unit_size));
+fold_convert (ssizetype, 
index),
+fold_convert (ssizetype, 
unit_size;
  }
  break;


thus compute index * size multiplication in a signed type and
then only convert the result back to unsigned then it for example
improves the testcase in PR52563 comment #4 in the following way:

 .L5:
-   movq%rdx, %rcx
-   movq%rax, %r8
-   movl$100, (%rax)
-   addq%rdi, %rdx
-   addq%rsi, %rax
-   cmpq$999, %rcx
+   movl$100, a(,%rax,4)
+   addq%rdi, %rax
+   movq%rdx, %rsi
+   addq%rcx, %rdx
+   cmpq$999, %rax
jle .L5

not sure if that's really faster in the end though, but IVOPTs
does its job in that case.

Now you need to argue that doing that is safe - the change does
two things.

 1) we interpret 'index' as signed
 2) we say the multiplication does not overflow

that would break for example

 unsigned long i = 0x8000UL;
 int a[1];
 a[i] = 0;

as i * 4 overflows to 0 and thus the access is valid.

Of course accessing the element 0x8000UL of an array
of size 1 invokes undefined behavior.  Now you get to prove
that there is no case that breaks where it wouldn't be

[Bug tree-optimization/62173] [5.0 regression] 64bit Arch can't ivopt while 32bit Arch can

2014-11-27 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #13 from rguenther at suse dot de  ---
On Thu, 27 Nov 2014, jiwang at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173
> 
> Jiong Wang  changed:
> 
>What|Removed |Added
> 
>  CC||rguenth at gcc dot gnu.org
>  Depends on||52563
> 
> --- Comment #12 from Jiong Wang  ---
> the root cause why it's not ivopted on AArch64 is because
> 
>   must_check_src_overflow = TYPE_PRECISION (ct) < TYPE_PRECISION (type);
> 
> code above in convert_affine_scev, the input type is sizetype, so DI for 64bit
> arch, SI for 32bit arch, ct is SI, thus, must_check_src_overflow set to true
> for 64bit arch, then failed later scev_probably_wraps_p check.
> 
> And I found similar issue reported back in 2012, at bug 52563.
> 
> I verified this bug exist on other 64 archs, like mips64, ppc64, x86-64
> 
> Richard, on 52563, I see you was working on this, do you have any thoughts on
> this?

See comment #5 of that bug.  For 4.8(?) I started on work to relax
the type requirements of the offset parameter of POINTER_PLUS_EXPR
by abstracting stuff but I didn't get to continue on that work.

Basically that we force the offset to 'sizetype' has both correctness
issues (for targets where sizetype precision doesn't match Pmode
precision) and optimization issues as we lose for example sign
information and overflow knowledge in the computation of the offset.
The last thing is also because we have transforms in fold which
push typecasts of expressions down to operands - thus
from (sizetype) (4 * i) we get 4 * (sizetype)i which may now be
an unsigned multiplication with wrapping overflow.  Note that
it is the frontends who start the conversion thing and apply some
"tricks" for code-gen (see pointer_int_sum in c-common.c).
It's also not clear whether if you write p[i] with i of type int
the multiplication by sizeof (*p) invokes undefined behavior if
it wraps (that is, the C standard does not define the type the
multiplication is performed in but just defines things in terms
of array elements).

Ideally we'd use a widening multiplication here but optimizers
have little knowledge of that so it probably would cause quite
some regressions.  We could also keep the multiplication signed
(but using ssizetype), but then fold will come along and
undo that trick IIRC.

That said, both the POINTER_PLUS_EXPR constraints on the offset
type _and_ the C language issue with int * sizeof (element)
overflowing for 64bit pointer sizes prevent us from optimizing this.

It's a very tricky area ;)