[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 Andrew Pinski changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |pinskia at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #13 from Andrew Pinski --- Mine for GCC 13, I have patches which turn: W_6 = BIT_INSERT_EXPR ; W_7 = BIT_INSERT_EXPR ; W_8 = BIT_INSERT_EXPR ; W_9 = BIT_INSERT_EXPR ; Into: W_9 = {_2,_2,_2,_2}; This improvement deals with bitfields but vectors have a similar issue with Bit_inserts so I deal with it there.
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 Richard Biener changed: What|Removed |Added Status|ASSIGNED|NEW Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot gnu.org --- Comment #12 from Richard Biener --- We now generate g: .LFB0: .cfi_startproc pxor%xmm1, %xmm1 addl$1, %edi movaps %xmm1, %xmm0 cvtsi2ss%edi, %xmm0 shufps $36, %xmm0, %xmm1 movaps %xmm1, %xmm0 cvtsi2ss%edi, %xmm0 shufps $196, %xmm0, %xmm1 movaps %xmm1, %xmm0 unpcklps%xmm1, %xmm0 cvtsi2ss%edi, %xmm0 shufps $225, %xmm1, %xmm0 cvtsi2ss%edi, %xmm0 ret or with SSE4 g: .LFB0: .cfi_startproc addl$1, %edi pxor%xmm1, %xmm1 pxor%xmm0, %xmm0 cvtsi2ss%edi, %xmm1 insertps$48, %xmm1, %xmm0 insertps$32, %xmm1, %xmm0 insertps$16, %xmm1, %xmm0 movss %xmm1, %xmm0 ret on GIMPLE we end up with g (int x) { float4 W; int _1; float _2; [local count: 1073741824]: _1 = x_3(D) + 1; _2 = (float) _1; W_6 = BIT_INSERT_EXPR ; W_7 = BIT_INSERT_EXPR ; W_8 = BIT_INSERT_EXPR ; W_9 = BIT_INSERT_EXPR ; return W_9; so we miss to recognize the splat. The GIMPLE looks like this very early already (update-address-taken + forwprop). SLP vectorization doesn't treat a BIT_INSERT_EXPR "reduction" as sink but we could probably pattern-match a VEC_DUPLICATE_EXPR for the above.
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 --- Comment #11 from Richard Biener --- Author: rguenth Date: Thu Nov 12 09:00:37 2015 New Revision: 230216 URL: https://gcc.gnu.org/viewcvs?rev=230216=gcc=rev Log: 2015-11-12 Richard BienerPR tree-optimization/58497 * tree-vect-generic.c: Include gimplify.h. (tree_vec_extract): Lookup constant/constructor DEFs. (do_cond): Unshare cond. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-generic.c
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 Rainer Orth changed: What|Removed |Added CC||ro at gcc dot gnu.org --- Comment #5 from Rainer Orth --- Created attachment 36685 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36685=edit -fdump-tree-optimized dump
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 --- Comment #6 from Rainer Orth --- The new gcc.dg/tree-ssa/vector-5.c testcase FAILs on 64-bit Solaris/SPARC: FAIL: gcc.dg/tree-ssa/vector-5.c scan-tree-dump-times optimized " * 3;" 1 Rainer
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 --- Comment #8 from Rainer Orth --- Created attachment 36687 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36687=edit -fdump-tree-dom2-details dump
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 --- Comment #9 from rguenther at suse dot de --- On Wed, 11 Nov 2015, ro at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 > > --- Comment #8 from Rainer Orth --- > Created attachment 36687 > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36687=edit > -fdump-tree-dom2-details dump Ok, it's not supposed to look like this after lowering. Does SPARC not have an integer multiply instruction (SImode)? Then the FAIL is expected (though folding halfway does the transform anyway...).
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 --- Comment #10 from Richard Biener --- Index: gcc/tree-vect-generic.c === --- gcc/tree-vect-generic.c (revision 230146) +++ gcc/tree-vect-generic.c (working copy) @@ -105,6 +106,15 @@ static inline tree tree_vec_extract (gimple_stmt_iterator *gsi, tree type, tree t, tree bitsize, tree bitpos) { + if (TREE_CODE (t) == SSA_NAME) +{ + gimple *def_stmt = SSA_NAME_DEF_STMT (t); + if (is_gimple_assign (def_stmt) + && (gimple_assign_rhs_code (def_stmt) == VECTOR_CST + || (bitpos + && gimple_assign_rhs_code (def_stmt) == CONSTRUCTOR))) + t = gimple_assign_rhs1 (def_stmt); +} if (bitpos) { if (TREE_CODE (type) == BOOLEAN_TYPE) should fix it (in testing).
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 --- Comment #7 from Richard Biener --- (In reply to Rainer Orth from comment #6) > The new gcc.dg/tree-ssa/vector-5.c testcase FAILs on 64-bit Solaris/SPARC: > > FAIL: gcc.dg/tree-ssa/vector-5.c scan-tree-dump-times optimized " * 3;" 1 > > Rainer : v1_2 = {i_1(D), i_1(D), i_1(D), i_1(D)}; _6 = i_1(D); _7 = i_1(D) * 3; _8 = i_1(D); _9 = i_1(D) * 3; _10 = i_1(D); _11 = i_1(D) * 3; _12 = i_1(D); _13 = i_1(D) * 3; _3 = {_7, _9, _11, _13}; err, why would DOM which runs after lower_vector_ssa _not_ CSE those multiplications? Pleas attach dom2-details dumps.
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 --- Comment #3 from Richard Biener --- Author: rguenth Date: Thu Oct 22 13:36:46 2015 New Revision: 229173 URL: https://gcc.gnu.org/viewcvs?rev=229173=gcc=rev Log: 2015-10-22 Richard BienerPR tree-optimization/58497 * tree-vect-generic.c (ssa_uniform_vector_p): New helper. (expand_vector_operations_1): Use it. Lower operations on all uniform vectors to scalar operations if the HW supports it. * gcc.dg/tree-ssa/vector-5.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-5.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-generic.c --- Comment #4 from Richard Biener --- Now we fix this up in veclower, still the bug should be addressed in SLP directly (also because it affects cost decisions).
[Bug tree-optimization/58497] SLP vectorizes identical operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 --- Comment #3 from Richard Biener --- Author: rguenth Date: Thu Oct 22 13:36:46 2015 New Revision: 229173 URL: https://gcc.gnu.org/viewcvs?rev=229173=gcc=rev Log: 2015-10-22 Richard BienerPR tree-optimization/58497 * tree-vect-generic.c (ssa_uniform_vector_p): New helper. (expand_vector_operations_1): Use it. Lower operations on all uniform vectors to scalar operations if the HW supports it. * gcc.dg/tree-ssa/vector-5.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-5.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-generic.c --- Comment #4 from Richard Biener --- Now we fix this up in veclower, still the bug should be addressed in SLP directly (also because it affects cost decisions).
[Bug tree-optimization/58497] SLP vectorizes identical operations
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-*-* Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2013-09-23 Depends on||53947 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener rguenth at gcc dot gnu.org --- Heh ;) I suppose this started with BIT_FIELD_REF support in SLP, 4.8 didn't vectorize this at all. Note that with for example typedef float float4 __attribute__((vector_size(16))); float4 g(int x) { float4 W; W[0]=W[1]=x+1; W[2]=x+2; W[3]=x+3; return W; } vectorizing two same operations may be profitable. But yes, if all scalars are the same there is no point to do it. And the cost model should have disabled it as well (though likely the four stores made it profitable in the end). I will have a look at some point. OTOH generated code is g: .LFB0: .cfi_startproc movl%edi, -12(%rsp) movd-12(%rsp), %xmm1 pshufd $0, %xmm1, %xmm0 paddd .LC0(%rip), %xmm0 cvtdq2ps%xmm0, %xmm0 ret vs. -fno-tree-vectorize: g: .LFB0: .cfi_startproc xorps %xmm1, %xmm1 addl$1, %edi xorps %xmm0, %xmm0 cvtsi2ss%edi, %xmm1 movaps %xmm0, %xmm2 movss %xmm1, %xmm2 shufps $36, %xmm2, %xmm0 movaps %xmm0, %xmm2 movss %xmm1, %xmm2 shufps $196, %xmm2, %xmm0 movaps %xmm0, %xmm2 unpcklps%xmm0, %xmm0 movss %xmm1, %xmm0 shufps $225, %xmm2, %xmm0 movss %xmm1, %xmm0 ret so clearly a win, but improvable to sth like addl$1, %edi cvtsi2ss%edi, %xmm1 pshufd $0, %xmm1, %xmm0 the above also shows that vector init by BIT_FIELD_REF is not expanded very well (sth for a generalized vector shuffle recognition in the bswap pass).
[Bug tree-optimization/58497] SLP vectorizes identical operations
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58497 --- Comment #2 from Richard Biener rguenth at gcc dot gnu.org --- Created attachment 30884 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=30884action=edit prototype patch A quick check shows generated code will be g: .LFB0: .cfi_startproc xorps %xmm0, %xmm0 addl$1, %edi cvtsi2ss%edi, %xmm0 shufps $0, %xmm0, %xmm0 ret and the patch shows possible issues with finding an insert location for the init stmt (otherwise external is just outside of the current basic-block).