[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2020-01-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #12 from Richard Biener  ---
Ah, sorry I failed to see the entry from BB2 isn't fallthru.  I agree the
difference is spurious in this particular case but in general it's quite hard
to do better without excessively rotating most multi-exit loops.  Note in
the end you also have to satisfy should_duplicate_loop_header_p for each
block to duplicate.

So what is suprious here is that we do not consider

 do
   {
   }
 while (i++ < n);

to be a do-while loop but we do for

 do
   {
   }
 while (++i < n);

due to the IV update in the latch.  IMHO technically that's correct.

The loop in question does not look like a do-while loop to us because
the loop header doesn't contain "most" of the loop body.  But CFG wise
it perfectly satisfies the predicate.

I think the new FAILs are somewhat spurious and one would need to investigate
on a case-by-case basis whether they really are a regression.  Often
they are testcases for specific input IL (which you changed) rather than
a specific C testcase.

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2020-01-16 Thread wwwhhhyyy333 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #11 from Hongyu Wang  ---
(In reply to rguent...@suse.de from comment #10)
> 
> It has two exits which makes it difficult
> Or impossible to make it truly do-while.
> But it's close enough and further rotating the loop doesn't make it better.

Thanks for the explanation. But the original loop without eliminating the
parser code is:

  [local count: 114863532]:
 goto ; [100.00%]

  [local count: 1014686025]:
 i.0_1 = (unsigned int) i_10;
 _2 = i.0_1 * 4;
 _3 = a_13(D) + _2;
 _4 = *_3;
 _5 = i.0_1 + 1;
 _6 = _5 * 4;
 _7 = a_13(D) + _6;
 _8 = *_7;
 if (_4 > _8)
   goto ; [5.50%]
 else
   goto ; [94.50%]

  [local count: 958878293]:
 i_15 = i_10 + 1;

  [local count: 1073741824]:
 # i_10 = PHI <0(2), i_15(4)>
 _9 = n_12(D) + -1;
 if (_9 > i_10)
   goto ; [94.50%]
 else
   goto ; [5.50%]

  [local count: 114863532]:
 # _11 = PHI <0(3), 1(5)>
 return _11;

This is considered as not a do-while loop since the latch (bb 4) is not empty,
and it is successfully rotated. I don't think there is much difference except
i_15 = i_10 + 1 which can be optimized by fre after the parser change. 

So I wonder if the original one is rotated, why the one with empty latch can't.

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2020-01-15 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #10 from rguenther at suse dot de  ---
On January 16, 2020 3:55:18 AM GMT+01:00, wwwhhhyyy333 at gmail dot com
 wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980
>
>--- Comment #9 from Hongyu Wang  ---
>(In reply to Hongtao.liu from comment #6)
>> New fail by removal
>
>> unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is
>now
>> do-while loop"
>> unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2
>"  if "
>> 3
>> unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is
>now
>> do-while loop"
>> unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2
>"Will
>> duplicate bb" 3
>> unix/-m32: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom
>"Store-stores
>> chain" 2
>> unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is
>now
>> do-while loop"
>> unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2
>"  if "
>> 3
>> unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is
>now
>> do-while loop"
>> unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2
>"Will
>> duplicate bb" 3
>
>I looked into this and found it is because of current judgement for
>do-while
>loop:
>
>--
>  [local count: 114863532]:  
> goto ; [100.00%]
>
>  [local count: 1014686025]: 
> i.0_1 = (unsigned int) i_11;  
> _2 = i.0_1 * 4;   
> _3 = a_14(D) + _2;
> _4 = *_3; 
> _5 = i_11 + 1;
> _6 = (unsigned int) _5;   
> _7 = _6 * 4;  
> _8 = a_14(D) + _7;
> _9 = *_8; 
> if (_4 > _9)  
>   goto ; [5.50%]
> else  
>   goto ; [94.50%]   
>
>  [local count: 958878294]:  
>
>  [local count: 1073741824]: 
> # i_11 = PHI <0(2), _5(6)>
> _10 = n_13(D) + -1;   
> if (_10 > i_11)   
>   goto ; [94.50%]   
> else  
>   goto ; [5.50%]
>
>  [local count: 114863532]:  
> # _12 = PHI <0(3), 1(4)>  
> return _12;   
>
>This is regarded as a do-while loop because the it satisfies all the
>condition
>for do-while loop judgement:
>
>1) Loop latch is empty.
>2) The latch has single predecessor.
>3) The latch predecessor will exit loop.
>
>But I don't think the loop above is a do-while loop, so does the
>conditions
>need to be modified?

It has two exits which makes it difficult
Or impossible to make it truly do-while.
But it's close enough and further rotating the loop doesn't make it better.

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2020-01-15 Thread wwwhhhyyy333 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #9 from Hongyu Wang  ---
(In reply to Hongtao.liu from comment #6)
> New fail by removal

> unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now
> do-while loop"
> unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 "  if "
> 3
> unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now
> do-while loop"
> unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will
> duplicate bb" 3
> unix/-m32: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom "Store-stores
> chain" 2
> unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now
> do-while loop"
> unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 "  if "
> 3
> unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now
> do-while loop"
> unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will
> duplicate bb" 3

I looked into this and found it is because of current judgement for do-while
loop:

--
  [local count: 114863532]:  
 goto ; [100.00%]

  [local count: 1014686025]: 
 i.0_1 = (unsigned int) i_11;  
 _2 = i.0_1 * 4;   
 _3 = a_14(D) + _2;
 _4 = *_3; 
 _5 = i_11 + 1;
 _6 = (unsigned int) _5;   
 _7 = _6 * 4;  
 _8 = a_14(D) + _7;
 _9 = *_8; 
 if (_4 > _9)  
   goto ; [5.50%]
 else  
   goto ; [94.50%]   

  [local count: 958878294]:  

  [local count: 1073741824]: 
 # i_11 = PHI <0(2), _5(6)>
 _10 = n_13(D) + -1;   
 if (_10 > i_11)   
   goto ; [94.50%]   
 else  
   goto ; [5.50%]

  [local count: 114863532]:  
 # _12 = PHI <0(3), 1(4)>  
 return _12;   

This is regarded as a do-while loop because the it satisfies all the condition
for do-while loop judgement:

1) Loop latch is empty.
2) The latch has single predecessor.
3) The latch predecessor will exit loop.

But I don't think the loop above is a do-while loop, so does the conditions
need to be modified?

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #8 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #4)

> But that is not true any more.  So I think this optimization can be removed
> as it is too early.  Just double check the above testcase and the C++
> testcase (g++.dg/opt/ptrintsum1.C) to make sure they still work and post
> that removal.  This optimization is most likely causing other missed
> optimizations already too.  So I would compile SPEC to see if there is any
> differences; my bet you might find some.

No big impact for SPEC2017, more or less like noise.

500.perlbench_r 0.21%
502.gcc_r   0.14%
505.mcf_r   -0.40%
520.omnetpp_r   -0.47%
523.xalancbmk_r -1.20%
525.x264_r  -1.26%
531.deepsjeng_r -0.05%
541.leela_r -0.39%
548.exchange2_r -0.09%
557.xz_r-0.16%
geomean for intrate -0.37%
503.bwaves_r-0.19%
507.cactuBSSN_r 0.23%
508.namd_r  -0.12%
510.parest_r0.18%
511.povray_r-0.30%
519.lbm_r   BuildSame   #VALUE!
521.wrf_r   -0.01%
526.blender_r   -0.44%
527.cam4_r  -0.17%
538.imagick_r   0.47%
544.nab_r   -1.00%
549.fotonik3d_r 0.09%
554.roms_r  0.28%
geomean for fprate  -0.08%
geomean -0.21%

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-19 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #7 from Andrew Pinski  ---
(In reply to Hongtao.liu from comment #6)
> New fail by removal
> 
> unix/-m32: c-c++-common/restrict-2.c  -Wc++-compat   scan-tree-dump-times
> lim2 "Moving statement" 11

> unix/-m32: c-c++-common/restrict-2.c  -std=gnu++14  scan-tree-dump-times
> lim2 "Moving statement" 11
> unix/-m32: c-c++-common/restrict-2.c  -std=gnu++17  scan-tree-dump-times
> lim2 "Moving statement" 11
> unix/-m32: c-c++-common/restrict-2.c  -std=gnu++2a  scan-tree-dump-times
> lim2 "Moving statement" 11
> unix/-m32: c-c++-common/restrict-2.c  -std=gnu++98  scan-tree-dump-times
> lim2 "Moving statement" 11

I think this is just there are a different number of statement moved out of the
loop.


The rest I don't know what is the issue but you should look into each one to
figure out what the difference is.

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #6 from Hongtao.liu  ---
New fail by removal

unix/-m32: c-c++-common/restrict-2.c  -Wc++-compat   scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: c-c++-common/restrict-2.c  -std=gnu++14  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: c-c++-common/restrict-2.c  -std=gnu++17  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: c-c++-common/restrict-2.c  -std=gnu++2a  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: c-c++-common/restrict-2.c  -std=gnu++98  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now do-while
loop"
unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 "  if " 3
unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now do-while
loop"
unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will
duplicate bb" 3
unix/-m32: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom "Store-stores
chain" 2
unix/-m32: gcc.dg/vect/pr57558-2.c -flto -ffat-lto-objects  scan-tree-dump vect
"vectorized 1 loops"
unix/-m32: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops"
unix/-m64: c-c++-common/builtins.c  -Wc++-compat  (test for excess errors)
unix/-m64: c-c++-common/restrict-2.c  -Wc++-compat   scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: c-c++-common/restrict-2.c  -std=gnu++14  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: c-c++-common/restrict-2.c  -std=gnu++17  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: c-c++-common/restrict-2.c  -std=gnu++2a  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: c-c++-common/restrict-2.c  -std=gnu++98  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now do-while
loop"
unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 "  if " 3
unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now do-while
loop"
unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will
duplicate bb" 3
unix/-m64: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom "Store-stores
chain" 2
unix/-m64: gcc.dg/vect/pr57558-2.c -flto -ffat-lto-objects  scan-tree-dump vect
"vectorized 1 loops"
unix/-m64: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops"

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-18 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #5 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #4)

> But that is not true any more.  So I think this optimization can be removed
> as it is too early.  Just double check the above testcase and the C++
> testcase (g++.dg/opt/ptrintsum1.C) to make sure they still work and post
> that removal.

Removal works fines with those 2 testcases.

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-18 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #4 from Andrew Pinski  ---
(In reply to Hongtao.liu from comment #3)
> 
> ptrop ---> src1 + 18446744073709551612;
> intop ---> j
> 
> It seems on purpose???

Kinda.
What needs to happen is a sign extend rather than a zero extend which is what
this code is doing really in a weird way.

Take (This is a modified testcase from PR4401 which was added in 2002 because
of a C++ bug report and when the code was moved from being C specific to being
C family specific):

char *a;
char b[] = "";

void foo (void)
{
  unsigned int i, j;

  i = 2;
  j = 3;
  a[i + 1 - j] += i;
}

int main (void)
{
  a = b;
  foo ();
  if (b[0] != 'A' + 2)
__builtin_abort ();
  __builtin_exit (0);
}

--- CUT ---
NOTE this code dates before the GCC code was moved into source control (1992).  
I suspect it has not changed because it is "working".  Also note from that time
period to now, there has been much more optimizations going on so this might be
something which should change too.

The comment says this:
  /* If what we are about to multiply by the size of the elements
 contains a constant term, apply distributive law
 and multiply that constant term separately.
 This helps produce common subexpressions.  */


But that is not true any more.  So I think this optimization can be removed as
it is too early.  Just double check the above testcase and the C++ testcase
(g++.dg/opt/ptrintsum1.C) to make sure they still work and post that removal. 
This optimization is most likely causing other missed optimizations already
too.  So I would compile SPEC to see if there is any differences; my bet you
might find some.

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-18 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #3 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #2)
> I think the problem is we are folding the right side of the array (with the
> conversion to size_t) too early.
> That is:
> src1[j-1]
> 
> Is being folded too early to have (j-1)*4
> 
> Fixing this up in match.pd is wrong.
> 
> This gets us the best code without any patch to match.pd:
> int foo(unsigned int *__restrict src1, int i, int k, int n)
> {
>   int j = k + n;
>   int sum = src1[j];
>   int jj = j-1;
>   sum += src1[jj];
>   if (i <= k)
> {
>   j+=2;
> int ii = j-3;
>   sum += src1[ii];
> }
>   return sum + j;
> }
> 
> See how j-1 and j-3 are not folded early and that fixes the issue.

It's done by parser, cut from c-common.c
--
3182   if (TREE_CODE (intop) == MINUS_EXPR)
3183 subcode = (subcode == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR);
3184   /* Convert both subexpression types to the type of intop,
3185  because weird cases involving pointer arithmetic
3186  can result in a sum or difference with different type args.  */
3187   ptrop = build_binary_op (EXPR_LOCATION (TREE_OPERAND (intop, 1)),
3188subcode, ptrop,
3189convert (int_type, TREE_OPERAND (intop,
1)),
3190true);
3191   intop = convert (int_type, TREE_OPERAND (intop, 0));
3192 }

--

ptrop ---> src1 + 18446744073709551612;
intop ---> j

It seems on purpose???

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-17 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-12-18
 Ever confirmed|0   |1
   Severity|normal  |enhancement

--- Comment #2 from Andrew Pinski  ---
I think the problem is we are folding the right side of the array (with the
conversion to size_t) too early.
That is:
src1[j-1]

Is being folded too early to have (j-1)*4

Fixing this up in match.pd is wrong.

This gets us the best code without any patch to match.pd:
int foo(unsigned int *__restrict src1, int i, int k, int n)
{
  int j = k + n;
  int sum = src1[j];
  int jj = j-1;
  sum += src1[jj];
  if (i <= k)
{
  j+=2;
int ii = j-3;
  sum += src1[ii];
}
  return sum + j;
}

See how j-1 and j-3 are not folded early and that fixes the issue.

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-17 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #1 from Hongtao.liu  ---
test.c.033.fre1
foo (unsigned int * restrict src1, int i, int k, int n)
{
  int sum;
  int j;
  long unsigned int _1;
  long unsigned int _2;
  unsigned int * _3;
  unsigned int _4;
  sizetype _7;
  unsigned int * _8;
  unsigned int _9;
  unsigned int _11;
  long unsigned int _12;
  long unsigned int _13;
  sizetype _14;
  unsigned int * _15;
  unsigned int _16;
  unsigned int _18;
  int _31;

   :
  j_23 = k_21(D) + n_22(D);
  _1 = (long unsigned int) j_23;
  _2 = _1 * 4;
  _3 = src1_24(D) + _2;
  _4 = *_3;
  sum_26 = (int) _4;
  _7 = _2 + 18446744073709551612;
  _8 = src1_24(D) + _7;
  _9 = *_8;
  _11 = _4 + _9;
  sum_27 = (int) _11;
  if (k_21(D) >= i_28(D))
goto ; [INV]
  else
goto ; [INV]

   :
  j_29 = j_23 + 2;
  _12 = (long unsigned int) j_29;
  _13 = _12 * 4;
  _14 = _13 + 18446744073709551604; --- it shoule be simplified to _7
  _15 = src1_24(D) + _14;
  _16 = *_15;
  _18 = _11 + _16;
  sum_30 = (int) _18;

   :
  # j_19 = PHI 
  # sum_20 = PHI 
  _31 = j_19 + sum_20;
  return _31;
}