[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 Richard Biener changed: What|Removed |Added Status|ASSIGNED|NEW
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #15 from rguenther at suse dot de --- On Fri, 30 Apr 2021, law at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 > > Jeffrey A. Law changed: > >What|Removed |Added > > CC||law at gcc dot gnu.org > > --- Comment #14 from Jeffrey A. Law --- > WRT Jim's comment about alignments in c#6. > > Right now a pointer's alignment is really only used to eliminate unnecessary > masking -- we don't propagate a pointer's known alignment to improve the known > alignment of memory operations involving that pointer. Hmm, but we do - in set_mem_attributes_minus_bitpos. But maybe below is for way later RTL generation, > This is something I'd cobbled together for a closely related issue which will > try to increase the known alignment of a MEM by using the alignment of a > pointer to that MEM. > > We've gone a slightly different (and more effective) route for that internal > issue, but this may still be worth polishing a bit and submitting. > > diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c > index 972512e81..be9ff76b5 100644 > --- a/gcc/emit-rtl.c > +++ b/gcc/emit-rtl.c > @@ -859,6 +859,28 @@ gen_rtx_MEM (machine_mode mode, rtx addr) > we clear it here. */ >MEM_ATTRS (rt) = 0; > > + /* If we can deduce a higher alignment for the memory access > + based on the pointer, then it's advantageous to do so. */ > + unsigned int align = 0; > + if (REG_P (addr) > + && REG_POINTER (addr)) > +align = REGNO_POINTER_ALIGN (REGNO (addr)); > + else if (GET_CODE (addr) == PLUS > + && REG_P (XEXP (addr, 0)) > + && REG_POINTER (XEXP (addr, 0)) > + && REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0))) > + && GET_CODE (XEXP (addr, 1)) == CONST_INT) > +{ > + unsigned int tmp = 1 << (ffs_hwi (INTVAL (XEXP (addr, 1))) - 1); > + /* ALIGN is in bits. */ > + tmp <<= 3; > + align = REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0))); > + align = (align > tmp) ? tmp : align; > +} > + > + if (align > mode_mem_attrs[(int) mode]->align) > +set_mem_align (rt, align); > + >return rt; > } > >
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 Jeffrey A. Law changed: What|Removed |Added CC||law at gcc dot gnu.org --- Comment #14 from Jeffrey A. Law --- WRT Jim's comment about alignments in c#6. Right now a pointer's alignment is really only used to eliminate unnecessary masking -- we don't propagate a pointer's known alignment to improve the known alignment of memory operations involving that pointer. This is something I'd cobbled together for a closely related issue which will try to increase the known alignment of a MEM by using the alignment of a pointer to that MEM. We've gone a slightly different (and more effective) route for that internal issue, but this may still be worth polishing a bit and submitting. diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c index 972512e81..be9ff76b5 100644 --- a/gcc/emit-rtl.c +++ b/gcc/emit-rtl.c @@ -859,6 +859,28 @@ gen_rtx_MEM (machine_mode mode, rtx addr) we clear it here. */ MEM_ATTRS (rt) = 0; + /* If we can deduce a higher alignment for the memory access + based on the pointer, then it's advantageous to do so. */ + unsigned int align = 0; + if (REG_P (addr) + && REG_POINTER (addr)) +align = REGNO_POINTER_ALIGN (REGNO (addr)); + else if (GET_CODE (addr) == PLUS + && REG_P (XEXP (addr, 0)) + && REG_POINTER (XEXP (addr, 0)) + && REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0))) + && GET_CODE (XEXP (addr, 1)) == CONST_INT) +{ + unsigned int tmp = 1 << (ffs_hwi (INTVAL (XEXP (addr, 1))) - 1); + /* ALIGN is in bits. */ + tmp <<= 3; + align = REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0))); + align = (align > tmp) ? tmp : align; +} + + if (align > mode_mem_attrs[(int) mode]->align) +set_mem_align (rt, align); + return rt; }
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 Alexandre Oliva changed: What|Removed |Added Assignee|aoliva at gcc dot gnu.org |unassigned at gcc dot gnu.org URL|https://gcc.gnu.org/piperma | |il/gcc-patches/2021-Februar | |y/565558.html | --- Comment #13 from Alexandre Oliva --- Since Jim Wilson says https://gcc.gnu.org/pipermail/gcc-patches/2021-February/565558.html does not address the problem in this bug, after all, I'm removing myself as assignee.
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #12 from rguenther at suse dot de --- On Wed, 3 Mar 2021, bina2374 at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 > > --- Comment #11 from Mel Chen --- > (In reply to Mel Chen from comment #10) > > (In reply to Richard Biener from comment #9) > > > (In reply to Mel Chen from comment #8) > > > > Sorry for using the bad example to describe the problem I am facing. > > > > Let me > > > > clarify my question with a more precise example. > > > > > > > > void array_mul(int N, int *C, short *A, short *B) { > > > > int i, j; > > > > for (i = 0; i < N; i++) { > > > > C[i] = 0; // Will be transformed to __builtin_memset > > > > for (j = 0; j < N; j++) { > > > > C[i] += (int)A[i * N + j] * (int)B[j]; > > > > } > > > > } > > > > } > > > > > > > > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the > > > > store > > > > operation 'C[i] = 0' can be eliminated by dead store elimination > > > > (dse3). But > > > > without -fno-tree-loop-distribute-patterns, it will be transformed to > > > > memset > > > > by loop distribution (ldist) because ldist executes before dse3. > > > > Finally the > > > > memset will not be eliminated. > > > > > > > > Another point is if there are other operations in the same level loop > > > > as the > > > > store operation, is it really beneficial to do loop distribution and > > > > then > > > > convert to builtin function? > > > > > > Sure, it shows a cost modeling issue given that usually loop distribution > > > merges partitions which touch the same memory stream (but IIRC maybe only > > > for loads). But more to the point we're missing to eliminate the dead > > > store > > > which should be appearant at least after PRE - LIM2 applied store motion > > > but only PRE elides the resulting load of C[i]. Usually DCE and DSE come > > > in > > > pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the > > > next DSE only happening after loop distribution. > > > > > > Which means we should eventually do > > > > > > diff --git a/gcc/passes.def b/gcc/passes.def > > > index e9ed3c7bc57..be3a9becde0 100644 > > > --- a/gcc/passes.def > > > +++ b/gcc/passes.def > > > @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3. If not see > > >NEXT_PASS (pass_sancov); > > >NEXT_PASS (pass_asan); > > >NEXT_PASS (pass_tsan); > > > + NEXT_PASS (pass_dse); > > >NEXT_PASS (pass_dce); > > >/* Pass group that runs when 1) enabled, 2) there are loops > > > in the function. Make sure to run pass_fix_loops before > > > > Yes, doing DSE before ldist is a simple and effective way. > > This patch has been verified to be work on coremark. Not only improved > > performance, but also code size. > > The test case gcc.dg/tree-ssa/ldist-33.c is failure after I added DSE. > > /* { dg-do compile { target size32plus } } */ > /* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns > -fdump-tree-ldist-details" } */ > > #define N (1024) > double a[N][N], b[N][N], c[N][N]; > > void > foo (void) > { > unsigned i, j, k; > > for (i = 0; i < N; ++i) > for (j = 0; j < N; ++j) > { > c[i][j] = 0.0; > for (k = 0; k < N; ++k) > c[i][j] += a[i][k] * b[k][j]; > } > } > > /* { dg-final { scan-tree-dump "Loop nest . distributed: split to 1 loops and > 1 > library" "ldist" } } */ > > It is similar to the example I showed earlier. DSE eliminated 'c[i][j] = 0.0' > so no loop will be split. My question is how to handle this test case? Add > -fno-tree-dse into dg-options, modify this test case, delete this test case, > or > others. I'd say disable store-motion with a comment (-fno-tree-loop-im, we might also finally add a separate switch to control the store-motion part of GIMPLE invariant motion). I think the testcase should show we can recognize the memset in a non-innermost loop (IIRC the above mimics what bwaves does).
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #11 from Mel Chen --- (In reply to Mel Chen from comment #10) > (In reply to Richard Biener from comment #9) > > (In reply to Mel Chen from comment #8) > > > Sorry for using the bad example to describe the problem I am facing. Let > > > me > > > clarify my question with a more precise example. > > > > > > void array_mul(int N, int *C, short *A, short *B) { > > > int i, j; > > > for (i = 0; i < N; i++) { > > > C[i] = 0; // Will be transformed to __builtin_memset > > > for (j = 0; j < N; j++) { > > > C[i] += (int)A[i * N + j] * (int)B[j]; > > > } > > > } > > > } > > > > > > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the > > > store > > > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). > > > But > > > without -fno-tree-loop-distribute-patterns, it will be transformed to > > > memset > > > by loop distribution (ldist) because ldist executes before dse3. Finally > > > the > > > memset will not be eliminated. > > > > > > Another point is if there are other operations in the same level loop as > > > the > > > store operation, is it really beneficial to do loop distribution and then > > > convert to builtin function? > > > > Sure, it shows a cost modeling issue given that usually loop distribution > > merges partitions which touch the same memory stream (but IIRC maybe only > > for loads). But more to the point we're missing to eliminate the dead store > > which should be appearant at least after PRE - LIM2 applied store motion > > but only PRE elides the resulting load of C[i]. Usually DCE and DSE come in > > pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the > > next DSE only happening after loop distribution. > > > > Which means we should eventually do > > > > diff --git a/gcc/passes.def b/gcc/passes.def > > index e9ed3c7bc57..be3a9becde0 100644 > > --- a/gcc/passes.def > > +++ b/gcc/passes.def > > @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3. If not see > >NEXT_PASS (pass_sancov); > >NEXT_PASS (pass_asan); > >NEXT_PASS (pass_tsan); > > + NEXT_PASS (pass_dse); > >NEXT_PASS (pass_dce); > >/* Pass group that runs when 1) enabled, 2) there are loops > > in the function. Make sure to run pass_fix_loops before > > Yes, doing DSE before ldist is a simple and effective way. > This patch has been verified to be work on coremark. Not only improved > performance, but also code size. The test case gcc.dg/tree-ssa/ldist-33.c is failure after I added DSE. /* { dg-do compile { target size32plus } } */ /* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details" } */ #define N (1024) double a[N][N], b[N][N], c[N][N]; void foo (void) { unsigned i, j, k; for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) { c[i][j] = 0.0; for (k = 0; k < N; ++k) c[i][j] += a[i][k] * b[k][j]; } } /* { dg-final { scan-tree-dump "Loop nest . distributed: split to 1 loops and 1 library" "ldist" } } */ It is similar to the example I showed earlier. DSE eliminated 'c[i][j] = 0.0' so no loop will be split. My question is how to handle this test case? Add -fno-tree-dse into dg-options, modify this test case, delete this test case, or others.
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #10 from Mel Chen --- (In reply to Richard Biener from comment #9) > (In reply to Mel Chen from comment #8) > > Sorry for using the bad example to describe the problem I am facing. Let me > > clarify my question with a more precise example. > > > > void array_mul(int N, int *C, short *A, short *B) { > > int i, j; > > for (i = 0; i < N; i++) { > > C[i] = 0; // Will be transformed to __builtin_memset > > for (j = 0; j < N; j++) { > > C[i] += (int)A[i * N + j] * (int)B[j]; > > } > > } > > } > > > > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store > > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But > > without -fno-tree-loop-distribute-patterns, it will be transformed to memset > > by loop distribution (ldist) because ldist executes before dse3. Finally the > > memset will not be eliminated. > > > > Another point is if there are other operations in the same level loop as the > > store operation, is it really beneficial to do loop distribution and then > > convert to builtin function? > > Sure, it shows a cost modeling issue given that usually loop distribution > merges partitions which touch the same memory stream (but IIRC maybe only > for loads). But more to the point we're missing to eliminate the dead store > which should be appearant at least after PRE - LIM2 applied store motion > but only PRE elides the resulting load of C[i]. Usually DCE and DSE come in > pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the > next DSE only happening after loop distribution. > > Which means we should eventually do > > diff --git a/gcc/passes.def b/gcc/passes.def > index e9ed3c7bc57..be3a9becde0 100644 > --- a/gcc/passes.def > +++ b/gcc/passes.def > @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3. If not see >NEXT_PASS (pass_sancov); >NEXT_PASS (pass_asan); >NEXT_PASS (pass_tsan); > + NEXT_PASS (pass_dse); >NEXT_PASS (pass_dce); >/* Pass group that runs when 1) enabled, 2) there are loops > in the function. Make sure to run pass_fix_loops before Yes, doing DSE before ldist is a simple and effective way. This patch has been verified to be work on coremark. Not only improved performance, but also code size.
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #9 from Richard Biener --- (In reply to Mel Chen from comment #8) > Sorry for using the bad example to describe the problem I am facing. Let me > clarify my question with a more precise example. > > void array_mul(int N, int *C, short *A, short *B) { > int i, j; > for (i = 0; i < N; i++) { > C[i] = 0; // Will be transformed to __builtin_memset > for (j = 0; j < N; j++) { > C[i] += (int)A[i * N + j] * (int)B[j]; > } > } > } > > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But > without -fno-tree-loop-distribute-patterns, it will be transformed to memset > by loop distribution (ldist) because ldist executes before dse3. Finally the > memset will not be eliminated. > > Another point is if there are other operations in the same level loop as the > store operation, is it really beneficial to do loop distribution and then > convert to builtin function? Sure, it shows a cost modeling issue given that usually loop distribution merges partitions which touch the same memory stream (but IIRC maybe only for loads). But more to the point we're missing to eliminate the dead store which should be appearant at least after PRE - LIM2 applied store motion but only PRE elides the resulting load of C[i]. Usually DCE and DSE come in pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the next DSE only happening after loop distribution. Which means we should eventually do diff --git a/gcc/passes.def b/gcc/passes.def index e9ed3c7bc57..be3a9becde0 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_sancov); NEXT_PASS (pass_asan); NEXT_PASS (pass_tsan); + NEXT_PASS (pass_dse); NEXT_PASS (pass_dce); /* Pass group that runs when 1) enabled, 2) there are loops in the function. Make sure to run pass_fix_loops before
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #8 from Mel Chen --- Sorry for using the bad example to describe the problem I am facing. Let me clarify my question with a more precise example. void array_mul(int N, int *C, short *A, short *B) { int i, j; for (i = 0; i < N; i++) { C[i] = 0; // Will be transformed to __builtin_memset for (j = 0; j < N; j++) { C[i] += (int)A[i * N + j] * (int)B[j]; } } } If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But without -fno-tree-loop-distribute-patterns, it will be transformed to memset by loop distribution (ldist) because ldist executes before dse3. Finally the memset will not be eliminated. Another point is if there are other operations in the same level loop as the store operation, is it really beneficial to do loop distribution and then convert to builtin function?
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #7 from Richard Biener --- Yes, it was expected that the patch cannot handle all cases since most definitely the ldist transform loses information on the access that is otherwise used to improve alignment info. I suggested to add __builtin_mem{set,cpy,move} variants with an extra argument specifying the (common, for cpy/move) 'alignment size'. Note that for unbound loop bounds we want to dispatch to libc memset even when we know much about alignment since libc is expected to have optimal code sequences for a variety of alignment/size combinations. As said elsewhere I believe we have code that can dynamically dispatch between a short inline sequence and a libcall dependent on the actual length but I don't remember whether this is/was in generic code or in target specific code (but I think it is done only when value-profile data is available).
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #6 from Jim Wilson --- Trying Alex's patch, it doesn't work on this testcase. One problem is that the loop bound is unknown, so the memset size is estimated as max 0x1fffc bytes. The code calls default_use_by_pieces_infrastructure_p which wants the total number of instructions to be less than SET_RATIO which is 16. The total number of instructions is computed as 0x1fffc. So the attempt fails. There is a contributing problem here that the alignment of the dest was computed as 8-bits, even though we have a pointer to int which has 32-bit alignment on a strict alignment system. That problem happens in expand_builtin_memset_args which calls get_pointer_alignment on dest. DEST is a SSN_NAME, and the SSA_NAME pointer info says align is 0, so it defaults to 8-bit byte alignment. I haven't tried to figure out where that went wrong. The patch does work as advertised on libgcc soft-float code. For an rv32i build I see the memset calls in libgcc.a reduced from 31 to 10.
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 Alexandre Oliva changed: What|Removed |Added Status|WAITING |ASSIGNED CC||aoliva at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |aoliva at gcc dot gnu.org --- Comment #5 from Alexandre Oliva --- Mine
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #4 from Richard Biener --- With profile feedback we (target or middle-end) can produce specialized RTL expansion doing small copies inline and larget ones offline. The idea of GIMPLE level pattern detection is that even for small sizes the target usually knows how to expand the copy optimally while the user may have written a byte copying loop. Of course that requires targets to pay attention. Note most compiler optimization involves some heuristics and clearly heuristics can be off. I wonder if you can obtain better coremark results by using link-time optimization. Iff you're only after benchmark numbers...
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #3 from Marc Glisse --- Does profile feedback (so we have an idea on the loop count) make any difference? It seems clear that for a loop that in practice just copies one long, having to arrange the arguments, make a function call, test for alignment, etc, is a lot of overhead. What to do about it though...
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 Andrew Pinski changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |WAITING Last reconfirmed||2020-03-09
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #2 from Andrew Pinski --- (In reply to Andrew Pinski from comment #1) > >For coremark, this is not only harmful to performance, but also code size. > > > Bad, very bad benchmark Coremark only handles very very small data sets by default. I think you should run your real code over this and see if it improves or not. Also -O2 does not care exactly about code size, that -Os only. Can you provide a test where besides coremark which decreases in performance and increase in size?
[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092 --- Comment #1 from Andrew Pinski --- >For coremark, this is not only harmful to performance, but also code size. Bad, very bad benchmark SPEC CPU is closer to real code but still bad benchmarks.