[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-05-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-05-03 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #15 from rguenther at suse dot de  ---
On Fri, 30 Apr 2021, law at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092
> 
> Jeffrey A. Law  changed:
> 
>What|Removed |Added
> 
>  CC||law at gcc dot gnu.org
> 
> --- Comment #14 from Jeffrey A. Law  ---
> WRT Jim's comment about alignments in c#6.
> 
> Right now a pointer's alignment is really only used to eliminate unnecessary
> masking -- we don't propagate a pointer's known alignment to improve the known
> alignment of memory operations involving that pointer.

Hmm, but we do - in set_mem_attributes_minus_bitpos.  But maybe below
is for way later RTL generation,

> This is something I'd cobbled together for a closely related issue which will
> try to increase the known alignment of a MEM by using the alignment of a
> pointer to that MEM.
> 
> We've gone a slightly different (and more effective) route for that internal
> issue, but this may still be worth polishing a bit and submitting.
> 
> diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
> index 972512e81..be9ff76b5 100644
> --- a/gcc/emit-rtl.c
> +++ b/gcc/emit-rtl.c
> @@ -859,6 +859,28 @@ gen_rtx_MEM (machine_mode mode, rtx addr)
>   we clear it here.  */
>MEM_ATTRS (rt) = 0;
> 
> +  /* If we can deduce a higher alignment for the memory access
> + based on the pointer, then it's advantageous to do so.   */
> +  unsigned int align = 0;
> +  if (REG_P (addr)
> +  && REG_POINTER (addr))
> +align = REGNO_POINTER_ALIGN (REGNO (addr));
> +  else if (GET_CODE (addr) == PLUS
> +  && REG_P (XEXP (addr, 0))
> +  && REG_POINTER (XEXP (addr, 0))
> +  && REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0)))
> +  && GET_CODE (XEXP (addr, 1)) == CONST_INT)
> +{
> +  unsigned int tmp = 1 << (ffs_hwi (INTVAL (XEXP (addr, 1))) - 1);
> +  /* ALIGN is in bits.  */
> +  tmp <<= 3;
> +  align = REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0)));
> +  align = (align > tmp) ? tmp : align;
> +}
> +
> +  if (align > mode_mem_attrs[(int) mode]->align)
> +set_mem_align (rt, align);
> +
>return rt;
>  }
> 
>

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-04-30 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Jeffrey A. Law  changed:

   What|Removed |Added

 CC||law at gcc dot gnu.org

--- Comment #14 from Jeffrey A. Law  ---
WRT Jim's comment about alignments in c#6.

Right now a pointer's alignment is really only used to eliminate unnecessary
masking -- we don't propagate a pointer's known alignment to improve the known
alignment of memory operations involving that pointer.

This is something I'd cobbled together for a closely related issue which will
try to increase the known alignment of a MEM by using the alignment of a
pointer to that MEM.

We've gone a slightly different (and more effective) route for that internal
issue, but this may still be worth polishing a bit and submitting.

diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 972512e81..be9ff76b5 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -859,6 +859,28 @@ gen_rtx_MEM (machine_mode mode, rtx addr)
  we clear it here.  */
   MEM_ATTRS (rt) = 0;

+  /* If we can deduce a higher alignment for the memory access
+ based on the pointer, then it's advantageous to do so.   */
+  unsigned int align = 0;
+  if (REG_P (addr)
+  && REG_POINTER (addr))
+align = REGNO_POINTER_ALIGN (REGNO (addr));
+  else if (GET_CODE (addr) == PLUS
+  && REG_P (XEXP (addr, 0))
+  && REG_POINTER (XEXP (addr, 0))
+  && REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0)))
+  && GET_CODE (XEXP (addr, 1)) == CONST_INT)
+{
+  unsigned int tmp = 1 << (ffs_hwi (INTVAL (XEXP (addr, 1))) - 1);
+  /* ALIGN is in bits.  */
+  tmp <<= 3;
+  align = REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0)));
+  align = (align > tmp) ? tmp : align;
+}
+
+  if (align > mode_mem_attrs[(int) mode]->align)
+set_mem_align (rt, align);
+
   return rt;
 }

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-04-28 Thread aoliva at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Alexandre Oliva  changed:

   What|Removed |Added

   Assignee|aoliva at gcc dot gnu.org  |unassigned at gcc dot 
gnu.org
URL|https://gcc.gnu.org/piperma |
   |il/gcc-patches/2021-Februar |
   |y/565558.html   |

--- Comment #13 from Alexandre Oliva  ---
Since Jim Wilson says
https://gcc.gnu.org/pipermail/gcc-patches/2021-February/565558.html does not
address the problem in this bug, after all, I'm removing myself as assignee.

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-03-03 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #12 from rguenther at suse dot de  ---
On Wed, 3 Mar 2021, bina2374 at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092
> 
> --- Comment #11 from Mel Chen  ---
> (In reply to Mel Chen from comment #10)
> > (In reply to Richard Biener from comment #9)
> > > (In reply to Mel Chen from comment #8)
> > > > Sorry for using the bad example to describe the problem I am facing. 
> > > > Let me
> > > > clarify my question with a more precise example.
> > > > 
> > > > void array_mul(int N, int *C, short *A, short *B) {
> > > >   int i, j;
> > > >   for (i = 0; i < N; i++) {
> > > > C[i] = 0; // Will be transformed to __builtin_memset
> > > > for (j = 0; j < N; j++) {
> > > >   C[i] += (int)A[i * N + j] * (int)B[j];
> > > > }
> > > >   }
> > > > }
> > > > 
> > > > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the 
> > > > store
> > > > operation 'C[i] = 0' can be eliminated by dead store elimination 
> > > > (dse3). But
> > > > without -fno-tree-loop-distribute-patterns, it will be transformed to 
> > > > memset
> > > > by loop distribution (ldist) because ldist executes before dse3. 
> > > > Finally the
> > > > memset will not be eliminated.
> > > > 
> > > > Another point is if there are other operations in the same level loop 
> > > > as the
> > > > store operation, is it really beneficial to do loop distribution and 
> > > > then
> > > > convert to builtin function?
> > > 
> > > Sure, it shows a cost modeling issue given that usually loop distribution
> > > merges partitions which touch the same memory stream (but IIRC maybe only
> > > for loads).  But more to the point we're missing to eliminate the dead 
> > > store
> > > which should be appearant at least after PRE - LIM2 applied store motion
> > > but only PRE elides the resulting load of C[i].  Usually DCE and DSE come 
> > > in
> > > pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
> > > next DSE only happening after loop distribution.
> > > 
> > > Which means we should eventually do
> > > 
> > > diff --git a/gcc/passes.def b/gcc/passes.def
> > > index e9ed3c7bc57..be3a9becde0 100644
> > > --- a/gcc/passes.def
> > > +++ b/gcc/passes.def
> > > @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
> > >NEXT_PASS (pass_sancov);
> > >NEXT_PASS (pass_asan);
> > >NEXT_PASS (pass_tsan);
> > > +  NEXT_PASS (pass_dse);
> > >NEXT_PASS (pass_dce);
> > >/* Pass group that runs when 1) enabled, 2) there are loops
> > >  in the function.  Make sure to run pass_fix_loops before
> > 
> > Yes, doing DSE before ldist is a simple and effective way.
> > This patch has been verified to be work on coremark. Not only improved
> > performance, but also code size.
> 
> The test case gcc.dg/tree-ssa/ldist-33.c is failure after I added DSE.
> 
> /* { dg-do compile { target size32plus } } */
> /* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns
> -fdump-tree-ldist-details" } */
> 
> #define N (1024)
> double a[N][N], b[N][N], c[N][N];
> 
> void
> foo (void)
> {
>   unsigned i, j, k;
> 
>   for (i = 0; i < N; ++i)
> for (j = 0; j < N; ++j)
>   {
> c[i][j] = 0.0;
> for (k = 0; k < N; ++k)
>   c[i][j] += a[i][k] * b[k][j];
>   }
> }
> 
> /* { dg-final { scan-tree-dump "Loop nest . distributed: split to 1 loops and 
> 1
> library" "ldist" } } */
> 
> It is similar to the example I showed earlier. DSE eliminated 'c[i][j] = 0.0'
> so no loop will be split. My question is how to handle this test case? Add
> -fno-tree-dse into dg-options, modify this test case, delete this test case, 
> or
> others.

I'd say disable store-motion with a comment (-fno-tree-loop-im, we might
also finally add a separate switch to control the store-motion part of
GIMPLE invariant motion).  I think
the testcase should show we can recognize the memset in a non-innermost
loop (IIRC the above mimics what bwaves does).

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-03-02 Thread bina2374 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #11 from Mel Chen  ---
(In reply to Mel Chen from comment #10)
> (In reply to Richard Biener from comment #9)
> > (In reply to Mel Chen from comment #8)
> > > Sorry for using the bad example to describe the problem I am facing. Let 
> > > me
> > > clarify my question with a more precise example.
> > > 
> > > void array_mul(int N, int *C, short *A, short *B) {
> > >   int i, j;
> > >   for (i = 0; i < N; i++) {
> > > C[i] = 0; // Will be transformed to __builtin_memset
> > > for (j = 0; j < N; j++) {
> > >   C[i] += (int)A[i * N + j] * (int)B[j];
> > > }
> > >   }
> > > }
> > > 
> > > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the 
> > > store
> > > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). 
> > > But
> > > without -fno-tree-loop-distribute-patterns, it will be transformed to 
> > > memset
> > > by loop distribution (ldist) because ldist executes before dse3. Finally 
> > > the
> > > memset will not be eliminated.
> > > 
> > > Another point is if there are other operations in the same level loop as 
> > > the
> > > store operation, is it really beneficial to do loop distribution and then
> > > convert to builtin function?
> > 
> > Sure, it shows a cost modeling issue given that usually loop distribution
> > merges partitions which touch the same memory stream (but IIRC maybe only
> > for loads).  But more to the point we're missing to eliminate the dead store
> > which should be appearant at least after PRE - LIM2 applied store motion
> > but only PRE elides the resulting load of C[i].  Usually DCE and DSE come in
> > pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
> > next DSE only happening after loop distribution.
> > 
> > Which means we should eventually do
> > 
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index e9ed3c7bc57..be3a9becde0 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
> >NEXT_PASS (pass_sancov);
> >NEXT_PASS (pass_asan);
> >NEXT_PASS (pass_tsan);
> > +  NEXT_PASS (pass_dse);
> >NEXT_PASS (pass_dce);
> >/* Pass group that runs when 1) enabled, 2) there are loops
> >  in the function.  Make sure to run pass_fix_loops before
> 
> Yes, doing DSE before ldist is a simple and effective way.
> This patch has been verified to be work on coremark. Not only improved
> performance, but also code size.

The test case gcc.dg/tree-ssa/ldist-33.c is failure after I added DSE.

/* { dg-do compile { target size32plus } } */
/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns
-fdump-tree-ldist-details" } */

#define N (1024)
double a[N][N], b[N][N], c[N][N];

void
foo (void)
{
  unsigned i, j, k;

  for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
  {
c[i][j] = 0.0;
for (k = 0; k < N; ++k)
  c[i][j] += a[i][k] * b[k][j];
  }
}

/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 1 loops and 1
library" "ldist" } } */

It is similar to the example I showed earlier. DSE eliminated 'c[i][j] = 0.0'
so no loop will be split. My question is how to handle this test case? Add
-fno-tree-dse into dg-options, modify this test case, delete this test case, or
others.

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-02-25 Thread bina2374 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #10 from Mel Chen  ---
(In reply to Richard Biener from comment #9)
> (In reply to Mel Chen from comment #8)
> > Sorry for using the bad example to describe the problem I am facing. Let me
> > clarify my question with a more precise example.
> > 
> > void array_mul(int N, int *C, short *A, short *B) {
> >   int i, j;
> >   for (i = 0; i < N; i++) {
> > C[i] = 0; // Will be transformed to __builtin_memset
> > for (j = 0; j < N; j++) {
> >   C[i] += (int)A[i * N + j] * (int)B[j];
> > }
> >   }
> > }
> > 
> > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
> > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
> > without -fno-tree-loop-distribute-patterns, it will be transformed to memset
> > by loop distribution (ldist) because ldist executes before dse3. Finally the
> > memset will not be eliminated.
> > 
> > Another point is if there are other operations in the same level loop as the
> > store operation, is it really beneficial to do loop distribution and then
> > convert to builtin function?
> 
> Sure, it shows a cost modeling issue given that usually loop distribution
> merges partitions which touch the same memory stream (but IIRC maybe only
> for loads).  But more to the point we're missing to eliminate the dead store
> which should be appearant at least after PRE - LIM2 applied store motion
> but only PRE elides the resulting load of C[i].  Usually DCE and DSE come in
> pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
> next DSE only happening after loop distribution.
> 
> Which means we should eventually do
> 
> diff --git a/gcc/passes.def b/gcc/passes.def
> index e9ed3c7bc57..be3a9becde0 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
>NEXT_PASS (pass_sancov);
>NEXT_PASS (pass_asan);
>NEXT_PASS (pass_tsan);
> +  NEXT_PASS (pass_dse);
>NEXT_PASS (pass_dce);
>/* Pass group that runs when 1) enabled, 2) there are loops
>  in the function.  Make sure to run pass_fix_loops before

Yes, doing DSE before ldist is a simple and effective way.
This patch has been verified to be work on coremark. Not only improved
performance, but also code size.

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-02-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #9 from Richard Biener  ---
(In reply to Mel Chen from comment #8)
> Sorry for using the bad example to describe the problem I am facing. Let me
> clarify my question with a more precise example.
> 
> void array_mul(int N, int *C, short *A, short *B) {
>   int i, j;
>   for (i = 0; i < N; i++) {
> C[i] = 0; // Will be transformed to __builtin_memset
> for (j = 0; j < N; j++) {
>   C[i] += (int)A[i * N + j] * (int)B[j];
> }
>   }
> }
> 
> If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
> operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
> without -fno-tree-loop-distribute-patterns, it will be transformed to memset
> by loop distribution (ldist) because ldist executes before dse3. Finally the
> memset will not be eliminated.
> 
> Another point is if there are other operations in the same level loop as the
> store operation, is it really beneficial to do loop distribution and then
> convert to builtin function?

Sure, it shows a cost modeling issue given that usually loop distribution
merges partitions which touch the same memory stream (but IIRC maybe only
for loads).  But more to the point we're missing to eliminate the dead store
which should be appearant at least after PRE - LIM2 applied store motion
but only PRE elides the resulting load of C[i].  Usually DCE and DSE come in
pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
next DSE only happening after loop distribution.

Which means we should eventually do

diff --git a/gcc/passes.def b/gcc/passes.def
index e9ed3c7bc57..be3a9becde0 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_sancov);
   NEXT_PASS (pass_asan);
   NEXT_PASS (pass_tsan);
+  NEXT_PASS (pass_dse);
   NEXT_PASS (pass_dce);
   /* Pass group that runs when 1) enabled, 2) there are loops
 in the function.  Make sure to run pass_fix_loops before

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-02-24 Thread bina2374 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #8 from Mel Chen  ---
Sorry for using the bad example to describe the problem I am facing. Let me
clarify my question with a more precise example.

void array_mul(int N, int *C, short *A, short *B) {
  int i, j;
  for (i = 0; i < N; i++) {
C[i] = 0; // Will be transformed to __builtin_memset
for (j = 0; j < N; j++) {
  C[i] += (int)A[i * N + j] * (int)B[j];
}
  }
}

If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
without -fno-tree-loop-distribute-patterns, it will be transformed to memset by
loop distribution (ldist) because ldist executes before dse3. Finally the
memset will not be eliminated.

Another point is if there are other operations in the same level loop as the
store operation, is it really beneficial to do loop distribution and then
convert to builtin function?

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-02-23 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #7 from Richard Biener  ---
Yes, it was expected that the patch cannot handle all cases since most
definitely the ldist transform loses information on the access that is
otherwise used to improve alignment info.  I suggested to add
__builtin_mem{set,cpy,move} variants with an extra argument specifying the
(common, for cpy/move) 'alignment size'.

Note that for unbound loop bounds we want to dispatch to libc memset even
when we know much about alignment since libc is expected to have optimal
code sequences for a variety of alignment/size combinations.  As said elsewhere
I believe we have code that can dynamically dispatch between a short inline
sequence and a libcall dependent on the actual length but I don't remember
whether this is/was in generic code or in target specific code (but I think
it is done only when value-profile data is available).

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-02-23 Thread wilson at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #6 from Jim Wilson  ---
Trying Alex's patch, it doesn't work on this testcase.

One problem is that the loop bound is unknown, so the memset size is estimated
as max 0x1fffc bytes.  The code calls
default_use_by_pieces_infrastructure_p which wants the total number of
instructions to be less than SET_RATIO which is 16.  The total number of
instructions is computed as 0x1fffc.  So the attempt fails.

There is a contributing problem here that the alignment of the dest was
computed as 8-bits, even though we have a pointer to int which has 32-bit
alignment on a strict alignment system.  That problem happens in
expand_builtin_memset_args which calls get_pointer_alignment on dest.  DEST is
a SSN_NAME, and the SSA_NAME pointer info says align is 0, so it defaults to
8-bit byte alignment.  I haven't tried to figure out where that went wrong.

The patch does work as advertised on libgcc soft-float code.  For an rv32i
build I see the memset calls in libgcc.a reduced from 31 to 10.

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-02-19 Thread aoliva at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Alexandre Oliva  changed:

   What|Removed |Added

 Status|WAITING |ASSIGNED
 CC||aoliva at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |aoliva at gcc dot 
gnu.org

--- Comment #5 from Alexandre Oliva  ---
Mine

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2020-03-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #4 from Richard Biener  ---
With profile feedback we (target or middle-end) can produce specialized
RTL expansion doing small copies inline and larget ones offline.  The
idea of GIMPLE level pattern detection is that even for small sizes
the target usually knows how to expand the copy optimally while the
user may have written a byte copying loop.

Of course that requires targets to pay attention.

Note most compiler optimization involves some heuristics and clearly heuristics
can be off.  I wonder if you can obtain better coremark results by using
link-time optimization.  Iff you're only after benchmark numbers...

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2020-03-09 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #3 from Marc Glisse  ---
Does profile feedback (so we have an idea on the loop count) make any
difference? It seems clear that for a loop that in practice just copies one
long, having to arrange the arguments, make a function call, test for
alignment, etc, is a lot of overhead. What to do about it though...

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2020-03-09 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Andrew Pinski  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2020-03-09

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2020-03-09 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #2 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #1)
> >For coremark, this is not only harmful to performance, but also code size.
> 
> 
> Bad, very bad benchmark 
Coremark only handles very very small data sets by default.  I think you should
run your real code over this and see if it improves or not.

Also -O2 does not care exactly about code size, that -Os only.

Can you provide a test where besides coremark which decreases in performance
and increase in size?

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2020-03-09 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #1 from Andrew Pinski  ---
>For coremark, this is not only harmful to performance, but also code size.


Bad, very bad benchmark 

SPEC CPU is closer to real code but still bad benchmarks.