[Bug tree-optimization/68906] [6 Regression] ICE at -O3 on x86_64-linux-gnu: verify_ssa failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68906 --- Comment #3 from Yuri Rumyantsev --- I've prepared simple fix which cures ICE. I will send it for review tomorrow. 2015-12-15 12:50 GMT+03:00 jakub at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68906 > > Jakub Jelinek changed: > >What|Removed |Added > > CC||jakub at gcc dot gnu.org > > --- Comment #2 from Jakub Jelinek --- > This doesn't look to me like a mere omission to invalidate debug stmts after > some stmt move that (correctly) has not considered debug stmts when > determining > if they should be moved or not, but it looks to me like wrong-code > transformation. > Before unswitch, if c is non-zero, we have endless loop, but during > unswitching > it is wrongly changed to branch to the bb that returns instead. > Say if you compile with -O3 (no -g): > int a; > volatile int b; > short c, d; > int > fn1 () > { > int e; > for (;;) > { > a = 3; > if (c) > continue; > e = 0; > for (; e > -30; e--) > if (b) > { > int f = e; > return d; > } > } > } > > int > main () > { > c = 1; > asm volatile ("" : : "m" (c) : "memory"); > fn1 (); > __builtin_abort (); > } > > then before the change this would just hang (expected), now it aborts instead. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
[Bug rtl-optimization/68920] [6 Regression] Undesirable if-conversion for a rarely taken branch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68920 --- Comment #4 from Yuri Rumyantsev --- You are quite right - the cost model is very poor. We did simple experiment and set up the branch cost to 1 but noticed performance regressions on other benchmarks. when we set it to 2 we did not see any difference since likely branch deletion is preferred for equal costs. Is there any tuned option in if-converter to revert this decision? Secondly, we must enhance cost model by adding cost of conditional move for all targets but this is for GCC 7.
[Bug tree-optimization/68894] New: Recognition min/max pattern with multiple arguments.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68894 Bug ID: 68894 Summary: Recognition min/max pattern with multiple arguments. Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- Analyzing one important benchmark (rgb to cmyk conversion) we found out that MIN pattern is not recognized for more than 2 arguments. I attached simple reproducer which exhibit the issue - explicit use of multiple
[Bug tree-optimization/68894] Recognition min/max pattern with multiple arguments.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68894 --- Comment #1 from Yuri Rumyantsev --- Created attachment 37026 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37026=edit test-case to reproduce It is sufficient to compile it with -O3 option to see the difference in produced assembly.
[Bug rtl-optimization/68898] ICE if rtl if-conversion is off.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68898 --- Comment #2 from Yuri Rumyantsev --- Forgot to add stack trace: Error: dominator of 6 status unknown t2.f:41:0: internal compiler error: Segmentation fault 0xb4e62f crash_signal /export/users/gnutester/stability/svn/trunk/gcc/toplev.c:334 0x376583567f ??? /home/glibctest/rpmbuild/BUILD/glibc-2.17-c758a686/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x7e7d0d verify_dominators(cdi_direction) /export/users/gnutester/stability/svn/trunk/gcc/dominance.c:1033 0x7e7fa7 checking_verify_dominators /export/users/gnutester/stability/svn/trunk/gcc/dominance.h:71 0x7e7fa7 calculate_dominance_info(cdi_direction) /export/users/gnutester/stability/svn/trunk/gcc/dominance.c:664 0x9a178e ira /export/users/gnutester/stability/svn/trunk/gcc/ira.c:5155 0x9a178e execute /export/users/gnutester/stability/svn/trunk/gcc/ira.c:5511
[Bug rtl-optimization/68898] ICE if rtl if-conversion is off.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68898 --- Comment #1 from Yuri Rumyantsev --- Created attachment 37028 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37028=edit test-case to reproduce Need to compile with -O2 -m32 -ffast-math options to reproduce. Note that 32-bit and -ffast-math flags are essential.
[Bug rtl-optimization/68898] New: ICE if rtl if-conversion is off.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68898 Bug ID: 68898 Summary: ICE if rtl if-conversion is off. Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- I tried to play with if-conversion flag and got ICE on all benchspec2 from spec2000 suite. I attach simple Fortran reproducer. Note that "-fno-if-conversion2" option does not lead to CF.
[Bug tree-optimization/68522] [6 Regression] SPEC CPU2006 435.gromacs miscomparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68522 Yuri Rumyantsev changed: What|Removed |Added CC||ysrumyan at gmail dot com --- Comment #5 from Yuri Rumyantsev --- I did deeper investigation of 435.gromacs miscomparison and found out that 1. It is caused by precision lost, i.e. this is not bug in split-paths phase. 2. This is caused by fmadd-sub instructions only (reproduced on avx2 with fma-support), i.e. with -fno-fma option bench is passed. 3. I found the first guilty routine split-paths for which leads to miscomparison: (fsettle) which is an ordinary fp-routine with big exit bb which is replicated. I assume that restriction on size of exit bb to be duplicated must be introduced to avoid useless code size growth. So you can close it after adding correspondent parameter-limit.
[Bug rtl-optimization/67145] [6 Regression] associativity from pseudo-reg ordering
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67145 --- Comment #3 from Yuri Rumyantsev --- Created attachment 37120 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37120=edit non-tested patch
[Bug rtl-optimization/67145] [6 Regression] associativity from pseudo-reg ordering
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67145 --- Comment #4 from Yuri Rumyantsev --- I attached simple non-tested patch which restores performance on x86. This change is no perfect but using it I noticed 2%-6% speed-up on 32-bit x86 platform. The idea of patch is very simple - we do not bail out if nothing changed but re-materialize all PLUS rtx-instructions with register-operand. It is important since an order of the operands in ops is different, i.e. if we have x + y + z on function entry, ops is {x,z,y} if REG(x) < REG(z) < REG(y).
[Bug rtl-optimization/69052] New: [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 Bug ID: 69052 Summary: [6 Regression] Performance regression after r229402. Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- In loop_invariant phase additional function inv_can_prop_to_addr_use which tried to determine if forward propagation for cheap address is possible through call of verify_changes which is very poor in comparison with combine phase. For example, for attached test-case it tries (gdb) call debug_rtx(def_insn) (insn 69 67 70 9 (set (reg/f:SI 149) (plus:SI (reg:SI 87) (const:SI (unspec:SI [ (symbol_ref:SI ("ind") [flags 0x2] ) ] UNSPEC_GOTOFF t1.c:40 212 {*leasi} (expr_list:REG_DEAD (reg:SI 87) (nil))) (gdb) call debug_rtx(use_insn) (insn 70 69 71 9 (set (reg:SI 150) (mem/u:SI (plus:SI (mult:SI (reg/v:SI 90 [ k ]) (const_int 4 [0x4])) (reg/f:SI 149)) [1 ind S4 A32])) t1.c:40 86 {*movsi_internal} (expr_list:REG_DEAD (reg/f:SI 149) (nil))) and determines that propagation is not possible: (gdb) p ok $1 = false but combine can do such substitution. This leads to undesired code motion and performance lost: for stmt out[ind[k]] = result before r229402 movlind@GOTOFF(%ebx,%esi,4), %eax movl12(%esp), %edi movl%ebp, (%edi,%eax,4) after r229402 movl28(%esp), %eax movl24(%esp), %ebx movl(%eax,%esi,4), %eax movl%edi, (%ebx,%eax,4) redundant fill has been generated by LRA. Since emulation combine phase is not so simple I assume that additional hook should be added to turn off such transformation for x86 in PIE mode.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #1 from Yuri Rumyantsev --- Created attachment 37133 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37133=edit test-case to reproduce It should be compile with -O2 -m32 options to reproduce.
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #11 from Yuri Rumyantsev --- In fact, the problem is quite different although it is caused by non-profitable pattern matching ~X CMP ~Y -> Y CMP X. In general this pattern may be helpful if we can delete not operation, e.g. x1 = ~x; y1 = ~y; if (x1 y1) ... and there no any other uses of x1 and y1, i.e. x1 and y1 have single use. But if this is not truth we will increase register pressure since we can not use the same register for x,x1 and y,y1. Richard proposed to use the same simplification for min/max operations but in original test-case nested min/max operation (min(x,min(y,z)) or multi operand min/max (min(x,y,z)) are not recognized by gcc (Note that icc does such transformation) and so this won't help since we have the same register pressure issue: c = ~r; m = ~g; y = ~b; k = min(c, m, y); *out++ = c - k; *out++ = m - k; *out++ = y - k; *out++ = k; and we can see that value of 'c' is used in min computation and resulting store, so if we will use r g comparison we will increase live range for r, g, b variables and additional registers will require for them (till comparison). Note also that there exists another issue with path-splitting (aka tail duplication) which duplicate loop back edge and in fact move tail block to hammock. This transformation does not loop useful (at least at given stage of design) but this is another topic for discussion. I'd like to propose to introduce new predicate for pattern matching which tells us how much uses have left-hand side of ~x.
[Bug middle-end/68542] [6 Regression] 10% 481.wrf performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68542 --- Comment #3 from Yuri Rumyantsev --- I enhanced a patch for masked stores movement by guard on zero mask - move all possible producers for stored value and performance degradation disappeared. the patch will be re-designed and send for review next week.
[Bug rtl-optimization/68435] [6 Regression] Missed if-conversion optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68435 --- Comment #6 from Yuri Rumyantsev --- It turned out that fresh gcc performs tail duplication (aka path splitting) preventing if-conversion. So I post a dump for 20150929 compiler which reproduces the issue.
[Bug rtl-optimization/68435] [6 Regression] Missed if-conversion optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68435 --- Comment #7 from Yuri Rumyantsev --- Created attachment 36780 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36780=edit rtl-ce1 dump file The dump is for 20150929 compiler
[Bug rtl-optimization/68435] [6 Regression] Missed if-conversion optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68435 --- Comment #4 from Yuri Rumyantsev --- Created attachment 36774 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36774=edit tar file tar file contains good and bad ce1-rtl dumps showing the problem
[Bug rtl-optimization/68435] [6 Regression] Missed if-conversion optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68435 --- Comment #2 from Yuri Rumyantsev --- I will post 2 rtl dumps for ce1 phase produced with -O2 -m32 options on ix86. You can see that file t21.c.203r.ce1 produced by 20110927 compiler contains 3 possible IF blocks searched. 1 IF blocks converted. 2 true changes made. but file t21.c.209r.ce1 produced by 20151119 compiler does not 1 possible IF blocks searched. 0 IF blocks converted. 0 true changes made.
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #23 from Yuri Rumyantsev --- OK. I will try to prepare the second part of patch. Few comments about vect-simd-clone-5.c test failure. 1. This loop is marked with safelen=MAX_INT. 2. It contains the following stmt's: D.3301 = foo.simdclone.1 (vect_vec_iv_.25_12, 123, _17); # VUSE <.MEM_39> _22 = MEM[(vector(2) long long int[2] *)]; # VUSE <.MEM_39> _23 = MEM[(vector(2) long long int[2] *) + 16B]; # .MEM_40 = VDEF <.MEM_39> D.3301 ={v} {CLOBBER}; vect__3.28_24 = VEC_PACK_TRUNC_EXPR <_22, _23>; and fuction ref_indep_Loop_p_1 checks that references MEM[(vector(2) long long int[2] *)] and MEM[(vector(2) long long int[2] *) + 16B] are independent. We can avoid such bad behavior of safelen-check (1) put restriction that loop does not contain non-analyzed references; (2) add additional check that reference does not have operands defined inside loop (D.3301 in our case). What approach is more profitable for you?
[Bug rtl-optimization/71453] Spills to vector registers are sub-optimal.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71453 --- Comment #2 from Yuri Rumyantsev --- Forgot to mention that number of instructions is on 10% more 632 vs 702 for spills into vector registers.
[Bug rtl-optimization/71453] New: Spills to vector registers are sub-optimal.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71453 Bug ID: 71453 Summary: Spills to vector registers are sub-optimal. Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- We notice significant performance regression on one important benchmark after r235523. Note that fix is not responsible for it. A problem is related to spill/fill to/from vector registers (aka xmm registers). For example, for attached test-case we can see a nimber of redundant "vector registers spills" and movements between them: vmovd %ecx, %xmm5 vmovd %xmm5, %ecx vmovd %xmm5, 40(%esp) !! It wil be more profitable to save %ecx on stack. vmovdqa %xmm3, %xmm5 !! this is completely redundant. ... There is also another issue with spill to vector registers - we must estimate profitability of such spill in comparison with spill on stack. For example, such spill can be not profitable if fill to register is not required: movl%eax, 44(%esp) !! spill ... andl44(%esp), %eax !! fill is not required.
[Bug rtl-optimization/71453] Spills to vector registers are sub-optimal.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71453 --- Comment #1 from Yuri Rumyantsev --- Created attachment 38659 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38659=edit test-case to reproduce Must be compiled with -O2 -march=core-avx2 -m32 options.
[Bug tree-optimization/71437] [7 regression' Performance regression after r235817
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71437 --- Comment #1 from Yuri Rumyantsev --- Created attachment 38652 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38652=edit test-case to reproduce Need to be compiled with -O3 -m32 -ffast-math on x86-64.
[Bug tree-optimization/71437] New: [7 regression' Performance regression after r235817
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71437 Bug ID: 71437 Summary: [7 regression' Performance regression after r235817 Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- We noticed ~10% slowdown on one important benchmark used for Silvermont testing. I can reproduced this performance gap using attached test-case on SandyBridge: before r235817 time ./good.exe W[100]=10 real0m0.761s r235817 W[100]=10 real0m0.863s THere exist another optimization opportunnty, which can be illustrated by the following test fragment: if( i == ( I - 1 ) ) L = pL[i] ; LD = (float)( L - pL[i] ) / (float)( pL[i + 1] - pL[i] ) ; It is clear that LD value is 0 if L == pL[i], i.e. we can move the second statement inside the hammock and perform simplification.
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #21 from Yuri Rumyantsev --- Richard! Are you planning to prepare the second part of the patch (zeroing safelen and testing it in loop invariant motion phase as you proposed)? Thanks.
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #25 from Yuri Rumyantsev --- Richard! I prepared the second part of patch and checked that it does not produce any new failures. What is your opinion - could I send it to GCC community for review? ChangeLog: 2016-06-10 Yuri Rumyantsev <ysrum...@gmail.com> PR tree-optimization/70729 * tree-ssa-loop-im.c (gather_mem_refs_stmt): Mark loop as having unanalyzed memory references. (ref_indep_loop_p_1): Consider memory reference as independent in loops having positive safelen value and not having unanalyzed memory references. (tree_ssa_lim_finalize): Clear-up aux field of loops. * tree-vect-loop.c (vect_transform_loop): Clear-up safelen value since it may be not valid after vectorization. gcc/testsuite/ChangeLog * g++.dg/vect/pr70729.cc: New test. 2016-06-09 9:42 GMT+03:00 rguenther at suse dot de <gcc-bugzi...@gcc.gnu.org>: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 > > --- Comment #24 from rguenther at suse dot de --- > On Wed, 8 Jun 2016, ysrumyan at gmail dot com wrote: > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 >> >> --- Comment #23 from Yuri Rumyantsev --- >> OK. I will try to prepare the second part of patch. >> Few comments about vect-simd-clone-5.c test failure. >> 1. This loop is marked with safelen=MAX_INT. >> 2. It contains the following stmt's: >> D.3301 = foo.simdclone.1 (vect_vec_iv_.25_12, 123, _17); >> # VUSE <.MEM_39> >> _22 = MEM[(vector(2) long long int[2] *)]; >> # VUSE <.MEM_39> >> _23 = MEM[(vector(2) long long int[2] *) + 16B]; >> # .MEM_40 = VDEF <.MEM_39> >> D.3301 ={v} {CLOBBER}; >> vect__3.28_24 = VEC_PACK_TRUNC_EXPR <_22, _23>; >> and fuction ref_indep_Loop_p_1 checks that references >> MEM[(vector(2) long long int[2] *)] >> and >> MEM[(vector(2) long long int[2] *) + 16B] >> are independent. >> We can avoid such bad behavior of safelen-check (1) put restriction that loop >> does not contain non-analyzed references; (2) add additional check that >> reference does not have operands defined inside loop (D.3301 in our case). >> >> What approach is more profitable for you? > > I think that we cannot use safelen() to disregard dependences > against "non-analyzed" references. This is because of exactly > such case. In future we might want to make less references > "non-analyzed" and use the general alias oracle on them > (the LIM dependence analysis predates that). > > So - simply put the safelen() check after the check for non-analyzed > reference in the disambiguator. > > -- > You are receiving this mail because: > You reported the bug.
[Bug rtl-optimization/71275] [7 regression] Performance drop after r235660 on x86-64 in 32-bit mode.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71275 --- Comment #1 from Yuri Rumyantsev --- Created attachment 38564 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38564=edit test-case to reproduce Must be compiled with -O2 -m32 -march=slm options.
[Bug tree-optimization/71347] [7 regression] Performance drop after r235513 on x86-64 in 32-bit mode.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71347 --- Comment #1 from Yuri Rumyantsev --- Created attachment 38600 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38600=edit test-case to reproduce Need to be compiled with -O2 -m32 -march=slm -ffast-math options on x64-64.
[Bug tree-optimization/71347] New: [7 regression] Performance drop after r235513 on x86-64 in 32-bit mode.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71347 Bug ID: 71347 Summary: [7 regression] Performance drop after r235513 on x86-64 in 32-bit mode. Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- We noticed significant regression (more then 10%) after this revision whicn can be illustrated on the following simple test-case )attached) - one additional instruction in innermost loop (scalar replacement is not recognized): before r235513 r235513 .L6: movsd X.1861+8, %xmm2movsd -8(%eax), %xmm2 addl$8, %eax movsd X.1861+8, %xmm1 .L3: mulsd %xmm2, %xmm0 mulsd %xmm1, %xmm2 cmpl$X.1861+64, %eax addl$8, %eax movsd %xmm0, (%eax) movsd %xmm2, -8(%eax) jne .L6cmpl$X.1861+72, %eax jne .L6
[Bug tree-optimization/69297] [6 Regression] Performance regression after r230020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69297 --- Comment #1 from Yuri Rumyantsev --- Created attachment 37356 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37356=edit test-case to reproduce TO reproduce compile with -Ofast -march=core-avx2 options.
[Bug tree-optimization/69297] New: [6 Regression] Performance regression after r230020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69297 Bug ID: 69297 Summary: [6 Regression] Performance regression after r230020 Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- This regression was found on spec2006/464.h264ref. The problem is related to SLP vectorization of BB's and caused by the wrong calculation of scalar cost, e.g. for attached test-case: Cost model analysis: Vector inside of basic block cost: 188 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 512 although the basic block contains only 96 statements. I found out that vect_bb_slp_scalar_cost takes into account the same stmt several times and results in non-profitable SLP vectorization.
[Bug rtl-optimization/67145] [6 Regression] associativity from pseudo-reg ordering
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67145 --- Comment #6 from Yuri Rumyantsev --- We checked that proposed patch does not introduce new performance regression and I will prepare it for review after bootstrapping and regression testing completion, likely tomorrow.
[Bug rtl-optimization/69274] New: [6 Regression] Performance regression after r231814 on x86 Haswell.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69274 Bug ID: 69274 Summary: [6 Regression] Performance regression after r231814 on x86 Haswell. Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- After this simple fix we got huge regression ( > 16%) for spec2006/435.gromacs on Haswell with "-O2 -ffast-math" options. Preliminary investigation have shown that 1he size of the hottest loop in benchmark (fsettle) became 10 instructions shorter (less spill/fill) but performance regressed significantly . Note that adding the first scheduler by "-fschedule-insns --param sched-pressure-algorithm=2 -fsched-pressure" gave us +24% speed-up (but only for this particular benchmark).
[Bug tree-optimization/69297] [6 Regression] Performance regression after r230020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69297 --- Comment #4 from Yuri Rumyantsev --- Yes, this loop was added for avoiding dce phase. Thanks. Yuri. 2016-01-18 13:33 GMT+03:00 rguenth at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69297 > > --- Comment #3 from Richard Biener --- > With a fix: > > t.c:76:10: note: Cost model analysis: > Vector inside of basic block cost: 376 > Vector prologue cost: 0 > Vector epilogue cost: 0 > Scalar cost of basic block: 96 > t.c:76:10: note: not vectorized: vectorization is not profitable. > > Note the reduction loop is still vectorized: > > t.c:74:5: note: Cost model analysis: > Vector inside of loop cost: 3 > Vector prologue cost: 1 > Vector epilogue cost: 7 > Scalar iteration cost: 3 > Scalar outside cost: 0 > Vector outside cost: 8 > prologue iterations: 0 > epilogue iterations: 0 > Calculated minimum iters for profitability: 4 > > but likely this isn't profitable either? > > -- > You are receiving this mail because: > You reported the bug.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #13 from Yuri Rumyantsev --- I checked that performance is back for the whole benchmark. Thanks a lot. Yuri. 2016-02-09 14:17 GMT+03:00 amker at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 > > --- Comment #12 from amker at gcc dot gnu.org --- > Patch sent for review at > https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00612.html > It works for the reduced test case, could you please help me to check if it > works for you original case? > Thanks, > bin > > -- > You are receiving this mail because: > You reported the bug.
[Bug tree-optimization/69652] [6 Regression] [ICE] verify_ssa fail w/ -O2 -ffast-math -ftree-vectorize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652 --- Comment #4 from Yuri Rumyantsev --- Jacub, Thanks a lot for your detail comments! I've just sent a patch for review to gcc-patches. Could you please take a look on it? Best regards. Yuri. 2016-02-03 20:22 GMT+03:00 jakub at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652 > > Jakub Jelinek changed: > >What|Removed |Added > > CC|jakub at redhat dot com| > > --- Comment #3 from Jakub Jelinek --- > Clearly a bug in optimize_mask_stores. > At the start of that function we have: > ... > mask__46.14_129 = vect__14.9_121 != vect__21.12_127; > _46 = _14 != _21; > mask__ifc__47.15_130 = mask__46.14_129; > _ifc__47 = _46; > MASK_STORE (vectp.16_132, 8B, mask__ifc__47.15_130, vect__22.13_128); > vect__24.20_140 = MEM[(double *)vectp.18_138]; > _24 = *_13; > vect__25.21_141 = vect__21.12_127 + vect__24.20_140; > _25 = _21 + _24; > MASK_STORE (vectp.22_145, 8B, mask__ifc__47.15_130, vect__25.21_141); > k_27 = k_28 + 1; > ... > Now, the MASK_STORE calls are processed from last to first, which is fine, we > first move the second MASK_STORE and the vector stmts that feed it: > vect__24.20_140 = MEM[(double *)vectp.18_138]; > vect__25.21_141 = vect__21.12_127 + vect__24.20_140; > MASK_STORE (vectp.22_145, 8B, mask__ifc__47.15_130, vect__25.21_141); > but then continue trying to move the second MASK_STORE into the same > conditional block, because it has the same mask. In this case it is wrong, > because there is > the scalar load in between (_24 = *_13) that just waits for DCE, but generally > there could be arbitrary code. > /* Put other masked stores with the same mask to STORE_BB. */ > if (worklist.is_empty () > || gimple_call_arg (worklist.last (), 2) != mask > || worklist.last () != stmt1) > break; > has a simplistic check (doesn't consider other MASK_STORE unless the walking > walked up to that stmt), but of course it doesn't work too well if some scalar > stmts were skipped. > > I see various issues in that function: > 1) wrong formatting: > gsi_to = gsi_start_bb (store_bb); > if (dump_enabled_p ()) > { > dump_printf_loc (MSG_NOTE, vect_location, >"Move stmt to created bb\n"); > dump_gimple_stmt (MSG_NOTE, TDF_SLIM, last, 0); > } > /* Move all stored value producers if possible. */ > while (!gsi_end_p (gsi)) > { > The Move all stored value and everything below up to corresponding closing } > should be moved two columns to the left > 2) IMHO stmt1 should be set to NULL before that while (!gsi_end_p (gsi)), > as the function is prepared to handle multiple bbs > 3) next to gimple_vdef non-NULL break IMHO should be also > gimple_has_volatile_ops -> break check, just for safety, we don't wanto to > mishandle say volatile reads etc. > 4) you have to skip over debug stmts if there are any, otherwise we have a > -fcompare-debug issue > 5) IMHO you should give up also for !is_gimple_assign, say trying to move an > elemental function call into the conditional is just wrong > 6) the > /* Skip scalar statements. */ > if (!VECTOR_TYPE_P (TREE_TYPE (lhs))) > continue; > should be reconsidered. IMHO if you have scalar stmts that feed just the > stmts > in the STORE_BB, there is no reason not to move them too, if you have scalar > stmts that feed other stmts too, IMHO you should give up on them if they have > a > vuse; what code did you have in mind when adding the !VECTOR_TYPE_P check? > 7) FOR_EACH_IMM_USE_FAST loop should ignore debug stmts, at least for > decisions > when to stop in some stmt; bet the debug stmts if there are any need to be > reset > if we move the def stmt that they are using, otherwise we risk -fcompare-debug > issues > 8) the worklist.last () != stmt1 check need to be -fcompare-debug friendly > too, > so if there are debug stmts in between the last moved stmt and the previous > MASK_STORE, we need to handle it as if there aren't any debug stmts in between > > -- > You are receiving this mail because: > You are on the CC list for the bug.
[Bug tree-optimization/69783] New: [6 Regression] Loop is not vectorized after r233212
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69783 Bug ID: 69783 Summary: [6 Regression] Loop is not vectorized after r233212 Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- After changes in vect_prune_runtime_alias_test_list() a number of merging ranges was significantly decreased: Before fix improved number of alias checks from 50 to 3 After fix improved number of alias checks from 50 to 22 and loop is not vectorized since number of versioning for alias run-time tests exceeds 10
[Bug tree-optimization/69783] [6 Regression] Loop is not vectorized after r233212
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69783 --- Comment #1 from Yuri Rumyantsev --- Created attachment 37671 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37671=edit test-case to reproduce It needs to be compiled with -Ofast -funroll-loops on x86-64
[Bug tree-optimization/69652] [6 Regression] [ICE] verify_ssa fail w/ -O2 -ffast-math -ftree-vectorize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652 --- Comment #5 from Yuri Rumyantsev --- Jacub, I'd like to clarify one your remark: 5) IMHO you should give up also for !is_gimple_assign, say trying to move an elemental function call into the conditional is just wrong What's wrong in call motion? Note that masked stores and loads are also represented as call. I assume that likely simd clone function calls msut not be moved. Thanks ahead. Yuri. P.S. It means that my patch is not correct and should be fixed. 2016-02-04 17:48 GMT+03:00 Yuri Rumyantsev: > Jacub, > > Thanks a lot for your detail comments! > > I've just sent a patch for review to gcc-patches. Could you please > take a look on it? > > Best regards. > Yuri. > > 2016-02-03 20:22 GMT+03:00 jakub at gcc dot gnu.org > : >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652 >> >> Jakub Jelinek changed: >> >>What|Removed |Added >> >> CC|jakub at redhat dot com| >> >> --- Comment #3 from Jakub Jelinek --- >> Clearly a bug in optimize_mask_stores. >> At the start of that function we have: >> ... >> mask__46.14_129 = vect__14.9_121 != vect__21.12_127; >> _46 = _14 != _21; >> mask__ifc__47.15_130 = mask__46.14_129; >> _ifc__47 = _46; >> MASK_STORE (vectp.16_132, 8B, mask__ifc__47.15_130, vect__22.13_128); >> vect__24.20_140 = MEM[(double *)vectp.18_138]; >> _24 = *_13; >> vect__25.21_141 = vect__21.12_127 + vect__24.20_140; >> _25 = _21 + _24; >> MASK_STORE (vectp.22_145, 8B, mask__ifc__47.15_130, vect__25.21_141); >> k_27 = k_28 + 1; >> ... >> Now, the MASK_STORE calls are processed from last to first, which is fine, we >> first move the second MASK_STORE and the vector stmts that feed it: >> vect__24.20_140 = MEM[(double *)vectp.18_138]; >> vect__25.21_141 = vect__21.12_127 + vect__24.20_140; >> MASK_STORE (vectp.22_145, 8B, mask__ifc__47.15_130, vect__25.21_141); >> but then continue trying to move the second MASK_STORE into the same >> conditional block, because it has the same mask. In this case it is wrong, >> because there is >> the scalar load in between (_24 = *_13) that just waits for DCE, but >> generally >> there could be arbitrary code. >> /* Put other masked stores with the same mask to STORE_BB. */ >> if (worklist.is_empty () >> || gimple_call_arg (worklist.last (), 2) != mask >> || worklist.last () != stmt1) >> break; >> has a simplistic check (doesn't consider other MASK_STORE unless the walking >> walked up to that stmt), but of course it doesn't work too well if some >> scalar >> stmts were skipped. >> >> I see various issues in that function: >> 1) wrong formatting: >> gsi_to = gsi_start_bb (store_bb); >> if (dump_enabled_p ()) >> { >> dump_printf_loc (MSG_NOTE, vect_location, >>"Move stmt to created bb\n"); >> dump_gimple_stmt (MSG_NOTE, TDF_SLIM, last, 0); >> } >> /* Move all stored value producers if possible. */ >> while (!gsi_end_p (gsi)) >> { >> The Move all stored value and everything below up to corresponding closing } >> should be moved two columns to the left >> 2) IMHO stmt1 should be set to NULL before that while (!gsi_end_p (gsi)), >> as the function is prepared to handle multiple bbs >> 3) next to gimple_vdef non-NULL break IMHO should be also >> gimple_has_volatile_ops -> break check, just for safety, we don't wanto to >> mishandle say volatile reads etc. >> 4) you have to skip over debug stmts if there are any, otherwise we have a >> -fcompare-debug issue >> 5) IMHO you should give up also for !is_gimple_assign, say trying to move an >> elemental function call into the conditional is just wrong >> 6) the >> /* Skip scalar statements. */ >> if (!VECTOR_TYPE_P (TREE_TYPE (lhs))) >> continue; >> should be reconsidered. IMHO if you have scalar stmts that feed just the >> stmts >> in the STORE_BB, there is no reason not to move them too, if you have scalar >> stmts that feed other stmts too, IMHO you should give up on them if they >> have a >> vuse; what code did you have in mind when adding the !VECTOR_TYPE_P check? >> 7) FOR_EACH_IMM_USE_FAST loop should ignore debug stmts, at least for >> decisions >> when to stop in some stmt; bet the debug stmts if there are any need to be >> reset >> if we move the def stmt that they are using, otherwise we risk >> -fcompare-debug >> issues >> 8) the worklist.last () != stmt1 check need to be -fcompare-debug friendly >> too, >> so if there are debug stmts in between the last moved stmt and the previous >> MASK_STORE, we need to handle it as if there aren't any debug stmts in >> between >> >> -- >> You are receiving
[Bug rtl-optimization/69633] [6 Regression] Redundant move is generated after r228097
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69633 --- Comment #1 from Yuri Rumyantsev --- Created attachment 37559 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37559=edit test-case to reproduce Need to be compiled with -O2 -m32 -pie -fPIE. Assume that -march=slm is not needed.
[Bug rtl-optimization/69633] New: [6 Regression] Redundant move is generated after r228097
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69633 Bug ID: 69633 Summary: [6 Regression] Redundant move is generated after r228097 Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- Sorry, that we noticed this regression just now but not in September. After Makarov's fix for 61578 ( and s390 regression) we noticed that for attached simple test-case extracted from real benchmark one more redundant move instruction is generated (till 20160202 compiler build): before fix (postreload dump) 86: NOTE_INSN_BASIC_BLOCK 4 40: dx:QI=[si:SI] 41: ax:QI=[si:SI+0x1] 42: {si:SI=si:SI+0x3;clobber flags:CC;} 43: dx:SI=zero_extend(dx:QI) 44: ax:SI=zero_extend(ax:QI) 45: cx:SI=zero_extend([si:SI-0x1]) 46: {di:SI=dx:SI*0x4c8b;clobber flags:CC;} 47: {bx:SI=ax:SI*0x9646;clobber flags:CC;} 48: {bx:SI=bx:SI+di:SI;clobber flags:CC;} 49: {di:SI=cx:SI*0x1d2f;clobber flags:CC;} 50: NOTE_INSN_DELETED 51: bx:SI=bx:SI+di:SI+0x8000 52: {bx:SI=bx:SI>>0x10;clobber flags:CC;} 53: [bp:SI]=bx:QI 96: bx:SI=dx:SI 55: {bx:SI=bx:SI<<0xf;clobber flags:CC;} 57: {bx:SI=bx:SI-dx:SI;clobber flags:CC;} after fix 86: NOTE_INSN_BASIC_BLOCK 4 40: dx:QI=[si:SI] 41: ax:QI=[si:SI+0x1] 42: {si:SI=si:SI+0x3;clobber flags:CC;} 43: dx:SI=zero_extend(dx:QI) 44: ax:SI=zero_extend(ax:QI) 45: cx:SI=zero_extend([si:SI-0x1]) 46: {di:SI=dx:SI*0x4c8b;clobber flags:CC;} 47: {bx:SI=ax:SI*0x9646;clobber flags:CC;} 48: {bx:SI=bx:SI+di:SI;clobber flags:CC;} 49: {di:SI=cx:SI*0x1d2f;clobber flags:CC;} 50: NOTE_INSN_DELETED 51: bx:SI=bx:SI+di:SI+0x8000 52: {bx:SI=bx:SI>>0x10;clobber flags:CC;} 53: [bp:SI]=bx:QI 96: bx:SI=dx:SI 55: {bx:SI=bx:SI<<0xf;clobber flags:CC;} 98: di:SI=bx:SI !! redundnat move 57: {di:SI=di:SI-dx:SI;clobber flags:CC;} In result, we got >3% slowdown on Silvermont in pie & 32-bit mode.
[Bug tree-optimization/69652] [6 Regression] [ICE] verify_ssa fail w/ -O2 -ffast-math -ftree-vectorize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69652 --- Comment #2 from Yuri Rumyantsev --- This is my fault - forgot to fix vuse for scalar statements which are crossed by masked stores during code motion. Fix is testing and will be sent for review tomorrow.
[Bug rtl-optimization/69942] gcc.dg/ifcvt-5.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69942 --- Comment #2 from Yuri Rumyantsev --- I attached patch which resolves failure.
[Bug rtl-optimization/69942] gcc.dg/ifcvt-5.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69942 --- Comment #3 from Yuri Rumyantsev --- Created attachment 37822 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37822=edit proposed patch Patch to resolve ifcvt5.c failure.
[Bug rtl-optimization/69942] gcc.dg/ifcvt-5.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69942 --- Comment #1 from Yuri Rumyantsev --- The cause of issue is that before ce1 phase pde (or pre) transformation has been done to remove partial redundant moves to variable i and j, i.e. code int i = x; int j = y; if (x > y) { i = a;' j = i; } has been transformed to int i,j; if (x > y) { i = a; j = i; } else { i = x; i = y; } and ifcvt phase does speculative motion else-part before if-part, i.e. to original code. This transformation is considered as true change and test is failed. I assume that test must accept also '6 basic blocks,' to get test passed.
[Bug tree-optimization/69467] New: [6 Regression] Pattern X * C1 CMP 0 to X CMP 0 causes performance drop on 32-bit x86.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69467 Bug ID: 69467 Summary: [6 Regression] Pattern X * C1 CMP 0 to X CMP 0 causes performance drop on 32-bit x86. Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- This is caused by the same revision as 67438 http://gcc.gnu.org/viewcvs/gcc?view=revision=225248 The issue can be reproduced with attached test-case. After such transformation applied to loop upper bound: for ( count = ((*(ptr)) & 0xf) * 2; count > 0; count--, addr++ ) two redundant instructions are generated: after before movl48(%esp), %ebx movl48(%esp), %ecx movzbl (%ebx), %eax movzbl (%ecx), %edx andl$15, %eax andl$15, %edx movzbl %al, %ecx addl%edx, %edx addl%ecx, %ecx testb %al, %al je .L12 je .L12 This can be essential if loop has low trip count.
[Bug tree-optimization/69467] [6 Regression] Pattern X * C1 CMP 0 to X CMP 0 causes performance drop on 32-bit x86.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69467 --- Comment #1 from Yuri Rumyantsev --- Created attachment 37462 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37462=edit test-case to reproduce Need to compile with -m32 at -O2 or -O3 -funroll-loops options. In description the assembly with -O3 -funroll-loops options was cited.
[Bug tree-optimization/69467] [6 Regression] Pattern X * C1 CMP 0 to X CMP 0 causes performance drop on 32-bit x86.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69467 --- Comment #3 from Yuri Rumyantsev --- Richard, I checked that performance is back with your patch. Thanks. 2016-01-25 17:50 GMT+03:00 rguenth at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69467 > > Richard Biener changed: > >What|Removed |Added > > Status|UNCONFIRMED |ASSIGNED >Last reconfirmed||2016-01-25 >Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot > gnu.org >Target Milestone|--- |6.0 > Ever confirmed|0 |1 > > --- Comment #2 from Richard Biener --- > To restore the state before the move from fold to match.pd we'd need to mark > any such pattern involving compares as the outermost expr (and thus match > on GIMPLE_CONDs) with an explicit && single_use () check. Fix for this one: > > Index: gcc/match.pd > === > --- gcc/match.pd(revision 232792) > +++ gcc/match.pd(working copy) > @@ -1821,12 +1821,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (for cmp (simple_comparison) > scmp (swapped_simple_comparison) > (simplify > - (cmp (mult @0 INTEGER_CST@1) integer_zerop@2) > + (cmp (mult@3 @0 INTEGER_CST@1) integer_zerop@2) >/* Handle unfolded multiplication by zero. */ >(if (integer_zerop (@1)) > (cmp @1 @2) > (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0)) > - && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0))) > + && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0)) > + && single_use (@3)) > /* If @1 is negative we swap the sense of the comparison. */ > (if (tree_int_cst_sgn (@1) < 0) > (scmp @0 @2) > > -- > You are receiving this mail because: > You reported the bug.
[Bug rtl-optimization/69633] [6 Regression] Redundant move is generated after r228097
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69633 --- Comment #3 from Yuri Rumyantsev --- Sorry for a confusion. The bug must be closed as user mistake. 2016-03-07 19:18 GMT+03:00 bernds at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69633 > > Bernd Schmidt changed: > >What|Removed |Added > > CC||bernds at gcc dot gnu.org > > --- Comment #2 from Bernd Schmidt --- > Doesn't seem to happen over here. Can you still reproduce this with trunk? > Please post exact arguments to cc1 if it does. > > -- > You are receiving this mail because: > You reported the bug.
[Bug tree-optimization/66142] Loop is not vectorized because not sufficient support for GOMP_SIMD_LANE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66142 --- Comment #26 from Yuri Rumyantsev --- If we convert copy structures to copy structure fields test will be vectorized and all mentions of GOMP_SIMD_LANE will be deleted. But if we slightly modify test by introducing new function vdot and insert its call: b = r.x * ray->dir.x + r.y * ray->dir.y; | v b = vdot (r, ray->dir); test won't be vectorized: test2.cpp:70:9: note: not vectorized: not suitable for scatter store D.6062[_9].org.x = 1.0e+0; test2.cpp is attached.
[Bug tree-optimization/66142] Loop is not vectorized because not sufficient support for GOMP_SIMD_LANE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66142 --- Comment #27 from Yuri Rumyantsev --- Created attachment 37940 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37940=edit test-case to reproduce Need to be compiled with -Ofast -mavx2 -fopenmp options.
[Bug target/70482] Opimization opportunity to vectorize basic block for -mavx target.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70482 --- Comment #2 from Yuri Rumyantsev --- Richard, The problem is in pattern matching: /* Pattern detected. */ if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "vect_recog_widen_mult_pattern: detected:\n"); /* Check target support */ vectype = get_vectype_for_scalar_type (half_type0); vecitype = get_vectype_for_scalar_type (itype); if (!vectype || !vecitype || !supportable_widening_operation (WIDEN_MULT_EXPR, last_stmt, vecitype, vectype, _code, _code, _int, _vec)) return NULL; We found paatern but it does not supported for 256-bit vectype and need to try for 128-bit.
[Bug tree-optimization/70482] New: Opimization opportunity to vectorize basic block for -mavx target.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70482 Bug ID: 70482 Summary: Opimization opportunity to vectorize basic block for -mavx target. Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- If we compile bb-slp-pattern-1.c from gcc.dg/vect suite with -mavx pattern vectorization won't happen since AVX has very poor support for 256-bit integer arithmetic. Particularly, widen-mult pattern is recognized but it is not supported for 256-bit vectors. Test is failed for native compiler build on AVX machine. The most simple decision is to use the same scheme as for loop vectorization by decreasing vector size from 256-bit to 128-bit.
[Bug rtl-optimization/70873] New: [GCC7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873 Bug ID: 70873 Summary: [GCC7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell. Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- This degradation is caused by known issue with partial register dependency: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57954 and can be reproduced with the attached simple test-case: before fix vxorpd%xmm4, %xmm4, %xmm4 vcvtss2sd(%esi,%eax,4), %xmm4, %xmm4 after fix vxorpd%xmm6, %xmm6, %xmm6 vcvtss2sd(%esi,%eax,4), %xmm6, %xmm7 I assume that register renaming must not split such register live range but simply consider it as one.
[Bug rtl-optimization/70873] [GCC7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873 --- Comment #1 from Yuri Rumyantsev --- Created attachment 38375 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38375=edit test-case to reproduce Must be compiled with -O2 -mavx2 -m32 options.
[Bug tree-optimization/70849] Loop can be vectorized through gathers on AVX2 platforms.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70849 --- Comment #1 from Yuri Rumyantsev --- Created attachment 38365 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38365=edit test-case to reproduce Must be compiled with -O3 -mavx2 options
[Bug tree-optimization/70849] New: Loop can be vectorized through gathers on AVX2 platforms.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70849 Bug ID: 70849 Summary: Loop can be vectorized through gathers on AVX2 platforms. Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- Simple test which will be attached is not vectorized as not profitable: test.c:11:5: note: cost model: the vector iteration cost = 2061 divided by the scalar iteration cost = 9 is greater or equal to the vectorization factor = 8. test.c:11:5: note: not vectorized: vectorization not profitable. test.c:11:5: note: not vectorized: vector version will never be profitable. but it can be vectorized as icc does using gathers: LOOP BEGIN at test.c(11,5) remark #15388: vectorization support: reference c1[j] has aligned access [ test.c(12,7) ] remark #15388: vectorization support: reference c2[j] has aligned access [ test.c(13,7) ] remark #15388: vectorization support: reference c1[j] has aligned access [ test.c(12,7) ] remark #15388: vectorization support: reference c2[j] has aligned access [ test.c(13,7) ] remark #15415: vectorization support: gather was generated for the variable <f[j+base]>, strided by 256 [ test.c(12,16) ] remark #15415: vectorization support: gather was generated for the variable <f[j+base+1]>, strided by 256 [ test.c(13,16) ] remark #15415: vectorization support: gather was generated for the variable <f[j+base]>, strided by 256 [ test.c(12,16) ] remark #15415: vectorization support: gather was generated for the variable <f[j+base+1]>, strided by 256 [ test.c(13,16) ] remark #15305: vectorization support: vector length 8 remark #15300: LOOP WAS VECTORIZED remark #15449: unmasked aligned unit stride stores: 4 remark #15460: masked strided loads: 4 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 18 remark #15477: vector loop cost: 12.000 remark #15478: estimated potential speedup: 1.500 remark #15488: --- end vector loop cost summary --- LOOP END
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #11 from Yuri Rumyantsev --- Richard, I slightly modify the patch proposed by you: 1. Apply loop->safelen check only if lim is invoked before loop vectorization since its value could be non-correct (I simply add bool param to it). 2. I prohibit to apply this check if loop contains unanalyzed memory references (e.g. calls, clobbers etc.). With these changes all regressions related to omp simd support were disappeared and the following failures left (because of changing order of transformation): FAIL: gcc.dg/autopar/outer-6.c scan-tree-dump-times parloops2 "parallelizing inner loop" 0 FAIL: gcc.dg/pr41783.c scan-tree-dump pre "pretmp[^\\n]* = a_global_var;" FAIL: gcc.dg/tree-ssa/loadpre10.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre23.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre24.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre25.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre4.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/loadpre8.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/ssa-pre-16.c scan-tree-dump-times pre "Eliminated: 1" 1 FAIL: gcc.dg/tree-ssa/ssa-pre-18.c scan-tree-dump pre "Replaced foo \\(f.y\\)" FAIL: gcc.dg/tree-ssa/ssa-pre-20.c scan-tree-dump pre "New PHIs: 2" FAIL: gcc.dg/tree-ssa/ssa-pre-3.c scan-tree-dump-times pre "Eliminated: 2" 1 FAIL: gfortran.dg/pr42108.f90 -O scan-tree-dump pre "in all uses of countm1[^\n]* / " FAIL: gfortran.dg/vect/fast-math-vect-8.f90 -O scan-tree-dump-times vect "vectorized 1 loops" 1 What is your opinion?
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #12 from Yuri Rumyantsev --- Created attachment 38367 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38367=edit modified patch
[Bug debug/70935] [6/7 Regression] ICE: verify_ssa failed (error: definition in block 9 does not dominate use in block 12) w/ -O3 -g
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70935 --- Comment #3 from Yuri Rumyantsev --- Jacub, Here is a simple fix - do not take into consideration edges destination of which is loop latch block, i.e. loop is endless: diff --git a/gcc/tree-ssa-loop-unswitch.c b/gcc/tree-ssa-loop-unswitch.c index dd6fd01..7de5fba 100644 --- a/gcc/tree-ssa-loop-unswitch.c +++ b/gcc/tree-ssa-loop-unswitch.c @@ -532,6 +532,12 @@ find_loop_guard (struct loop *loop) guard_edge->src->index, guard_edge->dest->index); return NULL; } + if (guard_edge->dest == loop->latch) +{ + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf(dump_file,"Guard edge destination is loop latch!\n"); + return NULL; +} if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, Is it OK for you?
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #6 from Yuri Rumyantsev --- Richard, I did change proposed by you but it still does not help since we have loop-carried dependency through this_4(D)->S_n: : _5 = this_4(D)->S_n; ... : pretmp_54 = this_4(D)->C2; pretmp_57 = this_4(D)->C1; pretmp_60 = MEM[(int * *)this_4(D) + 56B]; _20 = this_4(D)->S_n; : Loop header # i_33 = PHI <0(4), i_28(6)> # prephitmp_56 = PHI <_5(4), _20(6)> Recurrent phi. ... test.cpp:66:25: note: vect_is_simple_use: operand prephitmp_56 test.cpp:66:25: note: def_stmt: prephitmp_56 = PHI <_5(4), _20(6)> test.cpp:66:25: note: type of def: unknown test.cpp:66:25: note: Unsupported pattern. test.cpp:66:25: note: not vectorized: unsupported use in stmt.
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #1 from Yuri Rumyantsev --- Created attachment 38309 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38309=edit test-case to reproduce Must be compiled with -Ofast -mavx2 -fopenmp options on x86 machine.
[Bug tree-optimization/70729] New: Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 Bug ID: 70729 Summary: Loop marked with omp simd pragma is not vectorized Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- Analyzing performance of one important benchmark we found out that one of the hot loop is no vectorized since loop-invariant load of the class member has not been hoisted out of loop although loop was marked with omp simd pragma. Test-case to reproduce is attached.
[Bug rtl-optimization/71275] New: [7 regression] Performance drop after r235660 on x86-64 in 32-bit mode.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71275 Bug ID: 71275 Summary: [7 regression] Performance drop after r235660 on x86-64 in 32-bit mode. Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- Regression can be seen at attached test-case. In the tail block of innermost loop redundant fill was added: before r235660r235660 .L3: addl$1, %esi addl$1, %esi addl%eax, %ebxaddl%eax, %ebx movw%bp, (%edi,%ecx) movl44(%esp), %edx movswl %si, %ebp movswl %si, %eax cmpl(%esp), %ebp cmpl%edi, %eax jl .L6 movw%bp, (%edx,%ecx) jl .L6 In result we got up to 14% slow-down on one important benchmark. It is clear that it is not profitable to keep value of loop upper bound on register instead of the address base.
[Bug tree-optimization/72739] New: [7 Regression] FAIL: gcc.dg/vect/vect-mask-store-move-1.c after r238301
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72739 Bug ID: 72739 Summary: [7 Regression] FAIL: gcc.dg/vect/vect-mask-store-move-1.c after r238301 Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- We noticed that after this revision test is failed: FAIL: gcc.dg/vect/vect-mask-store-move-1.c scan-tree-dump-times vect "Move stm t to created bb" 4 FAIL: gcc.dg/vect/vect-mask-store-move-1.c -flto -ffat-lto-objects scan-tree- dump-times vect "Move stmt to created bb" 4 The problem is caused by complete deletion of vectorized loop which requires run-time alias check. Note that GCC 6 does not have such issue.
[Bug rtl-optimization/71956] [7 Regression] 176.gcc fails on 32 bits when compiled with -march=core-avx2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71956 Yuri Rumyantsev changed: What|Removed |Added CC||ysrumyan at gmail dot com --- Comment #2 from Yuri Rumyantsev --- Jakub, I removed both your revisions in cse.c (c1) but it did not help - 176.gcc stll gets RF on avx2 but not on avx. I assume that masked stores are responsible for it since we have them in binaries: .L2437: vmovd %ecx, %xmm1 vpxor %xmm5, %xmm5, %xmm5 addl-40(%ebp), %eax movl-28(%ebp), %edx vpbroadcastd-36(%ebp), %ymm4 vpaddd .LC1, %ymm4, %ymm2 vpbroadcastd%xmm1, %ymm1 leal(%edx,%eax,4), %eax vpsrlvd %ymm2, %ymm1, %ymm2 vpaddd %ymm7, %ymm4, %ymm3 vpand %ymm6, %ymm2, %ymm2 vpcmpeqd%ymm5, %ymm2, %ymm2 vpcmpeqd%ymm5, %ymm2, %ymm2 vptest %ymm2, %ymm2 je .L2446 vpmaskmovd %ymm0, %ymm2, (%eax) .L2446: vpsrlvd %ymm3, %ymm1, %ymm2 vpxor %xmm3, %xmm3, %xmm3 leal32(%eax), %edx vpaddd .LC3, %ymm4, %ymm4 vpand %ymm6, %ymm2, %ymm2 vpcmpeqd%ymm3, %ymm2, %ymm2 vpcmpeqd%ymm3, %ymm2, %ymm2 vptest %ymm2, %ymm2 je .L2447 vpmaskmovd %ymm0, %ymm2, (%edx) Will try to determine the correct revision responsible for it.
[Bug rtl-optimization/70467] Useless "and [esp],-1" emitted on AND with uint64_t variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70467 Yuri Rumyantsev changed: What|Removed |Added CC||ysrumyan at gmail dot com --- Comment #13 from Yuri Rumyantsev --- The fix r235764 introduced regression described in PR71956.
[Bug c/72794] New: [7 regression'] CF on spec2000/176.gcc after r238862.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794 Bug ID: 72794 Summary: [7 regression'] CF on spec2000/176.gcc after r238862. Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- We noticed that after this commit benchmark is failed with message: /tmp/cchqWD0Q.ltrans0.ltrans.o: In function `yylex': :(.text+0x566e): undefined reference to `is_reserved_word' /tmp/cchqWD0Q.ltrans8.ltrans.o: In function `compile_file': :(.text+0xb1fe): undefined reference to `is_reserved_word' :(.text+0xb22b): undefined reference to `is_reserved_word' :(.text+0xb248): undefined reference to `is_reserved_word' :(.text+0xb265): undefined reference to `is_reserved_word' i.e. function is_reserved_word with attribute "inline" was deleted but its calls were not inlined. To reproduce bench spec must be compiled with -Ofast -funroll-loops -flto -static -DSPEC_CPU2000_LP64 options. I did not try to prepare test-case to reproduce since assume that spec2000 suite is available.
[Bug c/72794] [7 regression'] CF on spec2000/176.gcc after r238862.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794 --- Comment #2 from Yuri Rumyantsev --- Yes, this option cures CF. Does it mean that we must compile spec2000 with this flag? 2016-08-03 19:08 GMT+03:00 pinskia at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794 > > --- Comment #1 from Andrew Pinski --- > Can you try with -std=gnu90 and see if that fixes the issue. > > -- > You are receiving this mail because: > You reported the bug.
[Bug c/72794] [7 regression] CF on spec2000/176.gcc after r238862.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794 --- Comment #6 from Yuri Rumyantsev --- Thanks for clarification. This bug can be closed as user misunderstanding. 2016-08-04 14:08 GMT+03:00 rguenth at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794 > > --- Comment #5 from Richard Biener --- > No, it's not a bug in the LTO phase - C99 inline simply does _not_ emit a > out-of-line copy. You have to add a extern declaration to force that. > > -- > You are receiving this mail because: > You reported the bug.
[Bug c/72794] [7 regression] CF on spec2000/176.gcc after r238862.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794 --- Comment #4 from Yuri Rumyantsev --- I assume that there is still issue in lto part of compiler - even if we ignore "inline" attribute we (lto) must not delete such functions from binaries. So this bug must be forwarded to lto phase. 2016-08-03 19:43 GMT+03:00 pinskia at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72794 > > Andrew Pinski changed: > >What|Removed |Added > > Status|UNCONFIRMED |RESOLVED > Resolution|--- |INVALID > > --- Comment #3 from Andrew Pinski --- > (In reply to Yuri Rumyantsev from comment #2) >> Yes, this option cures CF. Does it mean that we must compile spec2000 >> with this flag? > > Yes and it should be considered a portability flag. > > Basically GNU90 and ISO C99 inline behave slightly different which is why you > are seeing this. > > -- > You are receiving this mail because: > You reported the bug.
[Bug rtl-optimization/71956] [7 Regression] 176.gcc fails on 32 bits when compiled with -march=core-avx2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71956 --- Comment #4 from Yuri Rumyantsev --- Need to read "problem file is 176.gcc/src/sched.c, problem function sched_analyze_insn.
[Bug rtl-optimization/71956] [7 Regression] 176.gcc fails on 32 bits when compiled with -march=core-avx2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71956 --- Comment #3 from Yuri Rumyantsev --- It turned out that after r235653 (with minor int->bool type change) 176.gcc started RF. If we turn off vrp phase benchmark passes. The problem fail is sched.c. Note that avx2 is essential for reproducing. Try to understand what the issue is.
[Bug tree-optimization/71077] [7 Regression] gcc -lto raises ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71077 Yuri Rumyantsev changed: What|Removed |Added CC||ysrumyan at gmail dot com --- Comment #5 from Yuri Rumyantsev --- We found out that after r235653 with minor change of int->bool type 176.gcc still RF on HSW machine in 32-bit if opt level equal 3. If we turn off VRP phase by -fno-tree-vrp option benchmark is passed. Need to understand why this simplification affects on it.
[Bug testsuite/72850] [7 Regression] FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72850 Yuri Rumyantsev changed: What|Removed |Added CC||ysrumyan at gmail dot com --- Comment #3 from Yuri Rumyantsev --- We also noticed huge regression on coremark-pro/core benchmark after this revision. I attach test-case to reproduce.
[Bug testsuite/72850] [7 Regression] FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72850 --- Comment #4 from Yuri Rumyantsev --- Created attachment 39093 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39093=edit test-case to reproduce It is safficient use -Ofast option to compile on x86 machine.
[Bug middle-end/71734] [7 Regression] FAIL: libgomp.fortran/simd4.f90 -O3 -g execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71734 --- Comment #7 from Yuri Rumyantsev --- H.J. I've just checked this test with my local fixed compiler and got: Running /users/ysrumyan/workspaces/71261/gcc/testsuite/g++.dg/vect/vect.exp ... PASS: g++.dg/vect/pr70729.cc -std=c++11 scan-tree-dump vect "LOOP VECTORIZED" PASS: g++.dg/vect/pr70729.cc -std=c++11 (test for excess errors) PASS: g++.dg/vect/pr70729.cc -std=c++14 scan-tree-dump vect "LOOP VECTORIZED" PASS: g++.dg/vect/pr70729.cc -std=c++14 (test for excess errors) PASS: g++.dg/vect/pr70729.cc -std=c++98 scan-tree-dump vect "LOOP VECTORIZED" PASS: g++.dg/vect/pr70729.cc -std=c++98 (test for excess errors) So it looks like not my fault. 2016-07-18 21:38 GMT+03:00 seurer at linux dot vnet.ibm.com: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71734 > > Bill Seurer changed: > >What|Removed |Added > > CC||seurer at linux dot > vnet.ibm.com > > --- Comment #6 from Bill Seurer --- > Looks like the simd3/4 tests now work with this patch but > g++.dg/vect/pr70729.cc now fails: > > FAIL: g++.dg/vect/pr70729.cc -std=c++98 (test for excess errors) > FAIL: g++.dg/vect/pr70729.cc -std=c++11 (test for excess errors) > FAIL: g++.dg/vect/pr70729.cc -std=c++14 (test for excess errors) > > In the log I see > > /tmp/cc3mxFhd.s: Assembler messages: > /tmp/cc3mxFhd.s:29: Error: unrecognized opcode: `xsxexpdp' > compiler exited with status 1 > > and also > > /home/seurer/gcc/gcc-test/gcc/testsuite/g++.dg/vect/pr70729.cc:7:10: fatal > error: xmmintrin.h: No such file or directory > compilation terminated. > compiler exited with status 1 > > > Maybe some of the options you removed weren't really redundant? > > -- > You are receiving this mail because: > You are on the CC list for the bug.
[Bug tree-optimization/56688] Fortran save statement prevents loop vectorization.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56688 --- Comment #7 from Yuri Rumyantsev --- I checked that GCC 7 compiler still does not vectorize loops in thin6d function which is the only hottest function in 200.sixtrack benchmark.
[Bug rtl-optimization/65698] Non-optimal code for simple compare function for x86 32-bit target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65698 --- Comment #3 from Yuri Rumyantsev --- I see that this bug was no considered for a while. Here is my additional comment. First of all, this test was extracted from bzip2 benchmark, mainGTU function. The problem is that (1) tree optimizer collects cse for i1 * 2 and i2 * 2; (2) Forward propagation pass do not substitute it back to address computation since use_killed_between is very simplified it handles only simple basic block or semi-hammock: /* Finally, if DEF_BB is the sole predecessor of TARGET_BB. */ if (single_pred_p (target_bb) && single_pred (target_bb) == def_bb) This function must be enhanced to handle arbitrary cfg. Note that this deficiency increases register pressure on 2 and we have more spills/fills for x86 32-bit target.
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #36 from Yuri Rumyantsev --- #c33 testcase was not tested since I have some doubts about it. Note that original problem was #pragma omp simd for (int i=0; i: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 > > --- Comment #35 from Jakub Jelinek --- > Doesn't it still miscompile the #c33 testcase? > Say with __attribute__((noinline, noclone)) on baz and > int v[2048]; > > int > main () > { > v[1023] = 5; > baz (v, v + 1023, v + 1024, v + 1023); > int i; > for (i = 0; i < 1024; i++) > if (v[i] != 5 * 6 || v[1024 + i] != (i == 1023 ? 5 * 6 : 5) * 9) > __builtin_abort (); > return 0; > } > (untested)? > > -- > You are receiving this mail because: > You reported the bug.
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #34 from Yuri Rumyantsev --- Thanks a lot Jakub for your detail comments. I have simple fix which cures failures from 71734. The fix is simple enough and simply check that the ref in problem belongs to simd loop: diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index ee04826..c710bbe 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -2128,7 +2128,7 @@ ref_indep_loop_p_1 (struct loop *loop, im_mem_ref *ref, if (bitmap_bit_p (refs_to_check, UNANALYZABLE_MEM_ID)) return false; - if (loop->safelen > 0) + if (loop->safelen > 1 && bitmap_bit_p (refs_to_check, ref->id)) { if (dump_file && (dump_flags & TDF_DETAILS)) { and I checked that simd3.f90 and simd4.f90 from libgomp.fortran passed with it. 2016-07-04 18:30 GMT+03:00 jakub at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 > > --- Comment #33 from Jakub Jelinek --- > In any case, loop->safelen > 0 test looks also wrong, if there are guarantees > about single iteration only (safelen(1)), then there is nothing useful at all. > So it must be loop->safelen >= 2. > > For foo in #c29, the q[0] load in foo can be hoisted before the loop. > More complicated is e.g.: > void baz (int *p, int *q, int *r, int *s) > { > #pragma omp simd > for (int i = 0; i < 1024; i++) > { > p[i] += q[0] * 6; > r[i] += s[0] * 9; > } > } > Here IMNSHO only q[0] * 6 can be hoisted before the loop, while it can alias > p[1023] (or for x < 1023 p[x] if p[x] is initially 0), p[1023] could validly > alias s[0] and thus s[0] * 9 must not be hoisted. > > -- > You are receiving this mail because: > You reported the bug.
[Bug tree-optimization/70729] Loop marked with omp simd pragma is not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 --- Comment #37 from Yuri Rumyantsev --- Jakub, I assume that yoour #C33 test-case is not correct, i.e. it can not be marked with pragma omp simd. For example, even if we turn off lim phase it will be aborted: my_g++ -O3 -m64 t33.cpp -o t33.exe.1 -fopenmp -fno-tree-loop-im /users/ysrumyan/70729$ ./t33.exe.1 Aborted (core dumped) 2016-07-04 19:47 GMT+03:00 Yuri Rumyantsev: > #c33 testcase was not tested since I have some doubts about it. Note > that original problem was > #pragma omp simd > for (int i=0; i { > float w1 = C2[S_n + i] * w; > v1.v_i[i] += (int)w1; > C1[S_n + i] += w1; > } > > and we must hoist S_n out of loop to vectorize it. > > 2016-07-04 19:40 GMT+03:00 jakub at gcc dot gnu.org > : >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70729 >> >> --- Comment #35 from Jakub Jelinek --- >> Doesn't it still miscompile the #c33 testcase? >> Say with __attribute__((noinline, noclone)) on baz and >> int v[2048]; >> >> int >> main () >> { >> v[1023] = 5; >> baz (v, v + 1023, v + 1024, v + 1023); >> int i; >> for (i = 0; i < 1024; i++) >> if (v[i] != 5 * 6 || v[1024 + i] != (i == 1023 ? 5 * 6 : 5) * 9) >> __builtin_abort (); >> return 0; >> } >> (untested)? >> >> -- >> You are receiving this mail because: >> You reported the bug.
[Bug tree-optimization/56688] static/saved variables prevent loop vectorization.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56688 --- Comment #8 from Yuri Rumyantsev --- I checked that if we comment down 'save' stmt in thin6d.f all loops will be vectorized: grep -c 'LOOP VECTORIZED' thin6d.f.149t.vect 32
[Bug rtl-optimization/71956] [7 Regression] 176.gcc fails on 32 bits when compiled with -march=core-avx2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71956 --- Comment #5 from Yuri Rumyantsev --- This bug is fixed by Author: ppalka Date: Sat Aug 27 22:00:17 2016 New Revision: 239798 URL: https://gcc.gnu.org/viewcvs?rev=239798=gcc=rev Log: Fix folding of VECTOR_CST comparisons gcc/ChangeLog: PR tree-optimization/71077 PR tree-optimization/68542 * fold-const.c (fold_relational_const): Fix folding of VECTOR_CST comparisons that have a scalar boolean result type. (selftest::test_vector_folding): New static function. (selftest::fold_const_c_tests): Call it. gcc/testsuite/ChangeLog: PR tree-optimization/71077 * gcc.target/i386/pr71077.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr71077.c Modified: trunk/gcc/ChangeLog trunk/gcc/fold-const.c trunk/gcc/testsuite/ChangeLog So this bug must be closed.
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 --- Comment #1 from Yuri Rumyantsev --- Created attachment 39574 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39574=edit test-case to reproduce Need to compile with -O2 -ffast-math to reproduce.
[Bug tree-optimization/77498] New: [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 Bug ID: 77498 Summary: [7 regression] Performance drop after r239414 on spec2000/172mgrid Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- We noticed significant regression after https://gcc.gnu.org/viewcvs/gcc?view=revision=239414 I attached simple routine to reproduce. We can see that register pressure is 2x higher with this patch after pre. The regression is worse for 32-bit mode.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #1 from Yuri Rumyantsev --- Created attachment 39535 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39535=edit test-case to reproduce It is sufficient to compile it with -Ofast option.
[Bug tree-optimization/77445] New: [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 Bug ID: 77445 Summary: [7 Regression] Performance drop after r239219 on coremark test Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- We noticed huge (32%) performance drop on coremark-pro/core (former coremark benchmark) after http://gcc.gnu.org/viewcvs/gcc?view=revision=239219 The problem part is if (optimize_edge_for_speed_p (taken_edge)) which does not look correct since we have a lot of missed opportunities for jump threading optimization like: test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.111t.thread2:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.111t.thread2:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.111t.thread2:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.111t.thread2:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.167t.thread3:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.167t.thread3:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.167t.thread3:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.167t.thread3:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.170t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. test.c.170t.thread4:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.170t.thread4:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. test.c.170t.thread4:FSM jump-thread path not considered: duplication of 4 insns is needed and optimizing for size. If we change it to if (!optimize_function_for_size_p (cfun)) performance is back. I attach the test-case to reproduce issue.
[Bug tree-optimization/71077] [7 Regression] gcc -lto raises ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71077 --- Comment #7 from Yuri Rumyantsev --- I checked that proposed patch fixed RF for 176.gcc. Please, go ahead and commit your patch to trunk. Thanks. Yuri. 2016-08-12 20:14 GMT+03:00 patrick at parcs dot ath.cx <gcc-bugzi...@gcc.gnu.org>: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71077 > > --- Comment #6 from patrick at parcs dot ath.cx --- > On Fri, 12 Aug 2016, ysrumyan at gmail dot com wrote: > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71077 >> >> Yuri Rumyantsev changed: >> >>What|Removed |Added >> >> CC||ysrumyan at gmail dot com >> >> --- Comment #5 from Yuri Rumyantsev --- >> We found out that after r235653 with minor change of int->bool type 176.gcc >> still RF on HSW machine in 32-bit if opt level equal 3. If we turn off VRP >> phase by -fno-tree-vrp option benchmark is passed. Need to understand why >> this >> simplification affects on it. > > My only guess is that the combining step still doesn't handle > VECTOR_CSTs correctly. Could you please check if this patch fixes the > runtime failure? > > diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c > index 170e456..0db7bda 100644 > --- a/gcc/tree-ssa-threadedge.c > +++ b/gcc/tree-ssa-threadedge.c > @@ -577,6 +577,7 @@ simplify_control_stmt_condition_1 (edge e, >if (handle_dominating_asserts >&& (cond_code == EQ_EXPR || cond_code == NE_EXPR) >&& TREE_CODE (op0) == SSA_NAME > + && INTEGRAL_TYPE_P (TREE_TYPE (op0)) >&& integer_zerop (op1)) > { >gimple *def_stmt = SSA_NAME_DEF_STMT (op0); > > -- > You are receiving this mail because: > You are on the CC list for the bug.
[Bug target/77344] Internal Compiler Error with arch knl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77344 Yuri Rumyantsev changed: What|Removed |Added CC||ysrumyan at gmail dot com --- Comment #3 from Yuri Rumyantsev --- I checked that this bug has been fixed in GCC 6 branch some time ago and fresh version of it compiles this file successfully: GNU Fortran2008 (Revision=239431/svn-rev:239431/) version 6.1.1 20160812 (x86_64-pc-linux-gnu) compiled by GNU C version 6.1.1 20160812 It looks like you need to get next release of GCC 6 branch compiler. Note that I can reproduce ICE with the earlier GCC 6 branch compiler: compiled by GNU C version 6.1.1 20160617.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #1 from Yuri Rumyantsev --- Created attachment 39892 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39892=edit test-case to reproduce Must be compiled with "-Ofast -funroll-loops -march=knl" options.
[Bug rtl-optimization/78116] New: [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 Bug ID: 78116 Summary: [7 regression] Performance drop after r241173 on avx512 target Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- I attached the simple test-case to reproduce issue. Before this revision loop marked with label .L27 has 25 instructions but after it additional fills were added and it has +8 more instructions. In result we got > 6% performance drop on important benchmark.
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #3 from Yuri Rumyantsev --- Created attachment 39910 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39910=edit another test-case Must be compiled with "-Ofast -fopenmp -funroll-loops -march=knl"
[Bug rtl-optimization/78116] [7 regression] Performance drop after r241173 on avx512 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116 --- Comment #2 from Yuri Rumyantsev --- WE also found out performance drop on another important benchmark with the same symptoms after r241170, namely loop marked with .L18 has +12 more fills from stack. The test-case will be attached.
[Bug ipa/78268] [7 Regression] internal compiler error: Segmentation fault
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78268 Yuri Rumyantsev changed: What|Removed |Added CC||ysrumyan at gmail dot com --- Comment #5 from Yuri Rumyantsev --- We just got ICE for 471.omnetpp on x86 with guilty revision r241990
[Bug tree-optimization/78348] [7 REGRESSION] 15% performance drop for coremark-pro/nnet-test after r242038
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348 --- Comment #1 from Yuri Rumyantsev --- Created attachment 40036 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40036=edit test-case to reproduce Must be compiled with -O3 option to reproduce.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #4 from Yuri Rumyantsev --- Ping. Do you have any progress on this? Thanks.
[Bug tree-optimization/78348] New: [7 REGRESSION] 15% performance drop for coremark-pro/nnet-test after r242038
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348 Bug ID: 78348 Summary: [7 REGRESSION] 15% performance drop for coremark-pro/nnet-test after r242038 Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- We noticed huge (>15%) performance drop after fix in loop distribution phase. Before fix fix distribution is not performed since loop contains anti (write after read) dependence. But now distibution is performed and memmove & memset built-in are generated. We don't have fast implemention of memmove on HASWELL that results in leads to performance regression. But note that the dependence analysis is very poor and does not detect simple copying one struct field to another. I attached simple test-case to reproduce this issue. Note also that fix to pg_add_dependence_edges is correct and must not be removed.
[Bug tree-optimization/78496] New: Missed opportunities for jump threading
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78496 Bug ID: 78496 Summary: Missed opportunities for jump threading Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ysrumyan at gmail dot com Target Milestone: --- Created attachment 40131 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40131=edit test-case to reproduce, compile with -O3 option. We noticed a huge performance drop on one important benchmark which is caused by hoisting and collecting comparisons participated in conditional branches. Here is comments provided by Richard on it: Note this is a general issue with PRE which tends to see partial redundancies when it can compute an expression to a constant on one edge. There is nothing wrong with that but the particular example shows the lack of a cost model with respect to register pressure (same applies to other GIMPLE optimization passes). In this case we have a lot of expression anticipated from the same blocks where on one incoming edge their value is constant. Profitability here really depends on the "distance" of the to be inserted PHI and its use I guess. We're missing quite some jump-threading here as well: : # x1_197 = PHI <x1_261(15), x1_435(123), x1_435(105)> # _407 = PHI <_16(15), _16(123), 0(105)> # aa1_410 = PHI <aa1_185(15), aa1_185(123), aa1_216(105)> # d1_413 = PHI <d1_191(15), d1_191(123), d1_432(105)> # w1_416 = PHI <w1_260(15), w1_260(123), 0(105)> # v1_377 = PHI <v1_558(15), v1_558(123), 0(105)> # oo1_371 = PHI <oo1_567(15), oo1_567(123), oo1_194(105)> # ss1_376 = PHI <ss1_576(15), ss1_576(123), ss1_192(105)> # r1_609 = PHI <r1_585(15), r1_585(123), r1_190(105)> # _612 = PHI <_596(15), _596(123), _188(105)> # out_ind_lsm.82_322 = PHI <out_ind_lsm.82_321(15), out_ind_lsm.82_321(123), out_ind_lsm.82_532(105)> _549 = w1_416 <= 899; _548 = _407 > 839; _541 = _548 & _549; if (_541 != 0) goto ; else goto ; here 105 -> 16 -> 124 (forwarder) -> 18 which would eventually make PRE behave somewhat saner (avoding the far distances). The case appears with phicprop1 (or rather DOM, itself missing a followup transform with respect to folding a degenerate constant PHI plus the followup secondary threading opportunities). The backwards threader doesn't exploit the above opportunity though. Our forward threaders (like DOM) do. Unfortunately it requires quite a few iterations to get all opportunities exploited... (inserting 9 DOM/phi-only-cprop pass pairs "helps") I suggest to open a bugreport for this. Jeff may want to look at the threading issue (I believe the backward threader _does_ iterate). I attach a test-case to reproduce an issue.
[Bug tree-optimization/78348] [7 REGRESSION] 15% performance drop for coremark-pro/nnet-test after r242038
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348 --- Comment #5 from Yuri Rumyantsev --- Yes, I think so. 2016-11-15 14:49 GMT+03:00 rguenth at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348 > > Richard Biener changed: > >What|Removed |Added > > Status|UNCONFIRMED |NEW >Last reconfirmed||2016-11-15 > Ever confirmed|0 |1 > > --- Comment #4 from Richard Biener --- >> The issue is that memcpy must be produced instead of memove which does >> not have optimized version for avx2 x86 and simply uses byte copy. > > I'd expected a if (! overlap) memcpy () else byte-copy at least. > > Note the loop distribution code doesn't try to be clever in choosing memcpy > over memmove (using dependence analysis). So improving loop distribution > (adding a PKIND_MEMMOVE and conservatively using that from dependence > analysis) > is a possibility as well. But we have > > (compute_affine_dependence > stmt_a: _2 = par.0_1->x2[i_19][j_20]; > stmt_b: par.0_1->x1[i_19][j_20] = _2; > (analyze_overlapping_iterations > (chrec_a = {0, +, 1}_2) > (chrec_b = {0, +, 1}_2) > (overlap_iterations_a = [0]) > (overlap_iterations_b = [0])) > (analyze_overlapping_iterations > (chrec_a = i_19) > (chrec_b = i_19) > (overlap_iterations_a = [0]) > (overlap_iterations_b = [0])) > (analyze_overlapping_iterations > (chrec_a = 33280) > (chrec_b = 12800) > (analyze_ziv_subscript > ) > (overlap_iterations_a = no dependence) > (overlap_iterations_b = no dependence)) > ) -> no dependence > > so I think we could use memcpy for all no dependence cases? > > -- > You are receiving this mail because: > You reported the bug.