[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 --- Comment #15 from Jakub Jelinek --- Author: jakub Date: Fri May 20 11:55:58 2016 New Revision: 236505 URL: https://gcc.gnu.org/viewcvs?rev=236505=gcc=rev Log: PR tree-optimization/29756 gcc.dg/tree-ssa/vector-6.c: Add -Wno-psabi -w to dg-options. Add -msse2 for x86 and -maltivec for powerpc. Use scan-tree-dump-times only on selected targets where V4SImode vectors are known to be supported. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 --- Comment #14 from Richard Biener --- Author: rguenth Date: Fri May 20 09:17:16 2016 New Revision: 236501 URL: https://gcc.gnu.org/viewcvs?rev=236501=gcc=rev Log: 2016-05-20 Richard GuentherPR tree-optimization/29756 * tree.def (BIT_INSERT_EXPR): New tcc_expression tree code. * expr.c (expand_expr_real_2): Handle BIT_INSERT_EXPR. * fold-const.c (operand_equal_p): Likewise. (fold_ternary_loc): Add constant folding of BIT_INSERT_EXPR. * gimplify.c (gimplify_expr): Handle BIT_INSERT_EXPR. * tree-inline.c (estimate_operator_cost): Likewise. * tree-pretty-print.c (dump_generic_node): Likewise. * tree-ssa-operands.c (get_expr_operands): Likewise. * cfgexpand.c (expand_debug_expr): Likewise. * gimple-pretty-print.c (dump_ternary_rhs): Likewise. * gimple.c (get_gimple_rhs_num_ops): Handle BIT_INSERT_EXPR. * tree-cfg.c (verify_gimple_assign_ternary): Verify BIT_INSERT_EXPR. * tree-ssa.c (non_rewritable_lvalue_p): We can rewrite vector inserts using BIT_FIELD_REF or MEM_REF on the lhs. (execute_update_addresses_taken): Do it. * gcc.dg/tree-ssa/vector-6.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c Modified: trunk/gcc/ChangeLog trunk/gcc/cfgexpand.c trunk/gcc/expr.c trunk/gcc/fold-const.c trunk/gcc/gimple-pretty-print.c trunk/gcc/gimple.c trunk/gcc/gimplify.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-cfg.c trunk/gcc/tree-inline.c trunk/gcc/tree-pretty-print.c trunk/gcc/tree-ssa-operands.c trunk/gcc/tree-ssa.c trunk/gcc/tree.def
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 --- Comment #13 from rguenther at suse dot de --- On Thu, 19 May 2016, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 > > --- Comment #12 from Jakub Jelinek --- > (In reply to Richard Biener from comment #11) > > > Index: gcc/config/i386/i386.c > > === > > --- gcc/config/i386/i386.c (revision 236441) > > +++ gcc/config/i386/i386.c (working copy) > ... > > given the plethora of shuffling intrinsics this might be quite tedious > > work... > > The builtins aren't guaranteed to be usable directly, only the intrinsics are, > so if we want to do the above, we should just kill those builtins instead and > use __builtin_shuffle directly in the headers (plus of course each time verify > that we get the corresponding or better insn sequence). Yes, but that will result in sth like extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_shuffle_ps (__m128 __A, __m128 __B, int const __mask) { return (__m128) __builtin_shuffle2 ((__v4sf)__A, ((__v4sf)__B, (__v4si) { __mask & 3, (__mask >> 2) & 3, ((__mask >> 4) & 3) + 4, ((__mask >> 6) & 3) + 4) }); } (not sure if we still need the !__OPTIMIZE__ path or what we should do for that in general in the above context - once !__OPTIMIZE__ would no longer constant-fold or so) But if this would be the prefered way of addressing this that's clearly better than "folding" the stuff back.
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 --- Comment #12 from Jakub Jelinek --- (In reply to Richard Biener from comment #11) > Index: gcc/config/i386/i386.c > === > --- gcc/config/i386/i386.c (revision 236441) > +++ gcc/config/i386/i386.c (working copy) ... > given the plethora of shuffling intrinsics this might be quite tedious > work... The builtins aren't guaranteed to be usable directly, only the intrinsics are, so if we want to do the above, we should just kill those builtins instead and use __builtin_shuffle directly in the headers (plus of course each time verify that we get the corresponding or better insn sequence).
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 --- Comment #11 from Richard Biener --- Like Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 236441) +++ gcc/config/i386/i386.c (working copy) @@ -37745,6 +37745,23 @@ ix86_fold_builtin (tree fndecl, int n_ar gcc_assert (n_args == 1); return fold_builtin_cpu (fndecl, args); } + if (fn_code == IX86_BUILTIN_SHUFPS + && n_args == 3 + && TREE_CODE (args[2]) == INTEGER_CST) + { + tree mask[4]; + tree mtype = build_vector_type (integer_type_node, 4); + mask[0] = build_int_cst (integer_type_node, + TREE_INT_CST_LOW (args[2]) & 3); + mask[1] = build_int_cst (integer_type_node, + (TREE_INT_CST_LOW (args[2]) >> 2) & 3); + mask[2] = build_int_cst (integer_type_node, + ((TREE_INT_CST_LOW (args[2]) >> 4) & 3) + 4); + mask[3] = build_int_cst (integer_type_node, + ((TREE_INT_CST_LOW (args[2]) >> 6) & 3) + 4); + return fold_build3 (VEC_PERM_EXPR, TREE_TYPE (TREE_TYPE (fndecl)), + args[0], args[1], build_vector (mtype, mask)); + } } #ifdef SUBTARGET_FOLD_BUILTIN given the plethora of shuffling intrinsics this might be quite tedious work...
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 Uroš Bizjak changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #10 from Uroš Bizjak --- (In reply to Richard Biener from comment #9) > Uros, see comment#8 - would that be acceptable? The other alternative is to > try using __builtin_shuffle[2] in the intrinsic headers but that might be > somewhat difficult. I have added Jakub to CC, he is the expert in various permutation approaches for x86 target.
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 Richard Biener changed: What|Removed |Added Target||x86_64-*-*, i?86-*-* CC||uros at gcc dot gnu.org --- Comment #9 from Richard Biener --- Uros, see comment#8 - would that be acceptable? The other alternative is to try using __builtin_shuffle[2] in the intrinsic headers but that might be somewhat difficult.
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 --- Comment #8 from Richard Biener --- So the remaining piece may be that of the init-regs issue. We have vf_24 = BIT_INSERT_EXPR; which leaves the upper elements undefined, but init-regs forces them to zero. Another issue is that in _26 = BIT_FIELD_REF ; vf_24 = BIT_INSERT_EXPR ; _25 = __builtin_ia32_shufps (vf_24, vf_24, 0); the shufps is not exposed to gimple optimizations and thus we can't simplify it in any way. Only the backend knows that it could be simplified to _25 = __builtin_ia32_shufps (vf_13(D), vf_13(D), 85); so the backend might want to "expand" __builtin_ia32_shufps to a VEC_PERM_EXPR in its target specific builtin folding hook (making sure the reverse works well enough obviously).
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 --- Comment #7 from Richard Biener --- So I have it down to a x86 combine issue: ;; v_28 = BIT_FIELD_INSERT; (insn 7 6 8 (set (reg:SF 116) (vec_select:SF (reg/v:V4SF 115 [ v ]) (parallel [ (const_int 0 [0]) ]))) t.c:5 -1 (nil)) (insn 8 7 9 (set (reg:V4SF 117) (reg/v:V4SF 109 [ v ])) t.c:11 -1 (nil)) (insn 9 8 10 (set (reg:V4SF 117) (vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116)) (reg:V4SF 117) (const_int 1 [0x1]))) t.c:11 -1 (nil)) (insn 10 9 0 (set (reg/v:V4SF 110 [ v ]) (reg:V4SF 117)) t.c:11 -1 (nil)) that's from what vec_set_optab produces ;; _29 = __builtin_ia32_shufps (v_28, v_28, 0); (insn 11 10 12 (set (reg:V4SF 119) (reg/v:V4SF 110 [ v ])) t.c:12 -1 (nil)) (insn 12 11 13 (set (reg:V4SF 120) (reg/v:V4SF 110 [ v ])) t.c:12 -1 (nil)) (insn 13 12 14 (set (reg:V4SF 118) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 119) (reg:V4SF 120)) (parallel [ (const_int 0 [0]) (const_int 0 [0]) (const_int 4 [0x4]) (const_int 4 [0x4]) ]))) t.c:12 -1 (nil)) (insn 14 13 0 (set (reg:V4SF 111 [ _29 ]) (reg:V4SF 118)) t.c:12 -1 (nil)) and that's the shuffle. And after combine we have (insn 7 4 53 2 (set (reg:SF 116) (vec_select:SF (reg/v:V4SF 115 [ v ]) (parallel [ (const_int 0 [0]) ]))) t.c:5 2423 {*vec_extractv4sf_0} (nil)) (insn 9 53 13 2 (set (reg:V4SF 117 [ v ]) (vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116)) (const_vector:V4SF [ (const_double:SF 0.0 [0x0.0p+0]) (const_double:SF 0.0 [0x0.0p+0]) (const_double:SF 0.0 [0x0.0p+0]) (const_double:SF 0.0 [0x0.0p+0]) ]) (const_int 1 [0x1]))) t.c:11 2420 {vec_setv4sf_0} (expr_list:REG_DEAD (reg:SF 116) (nil))) (insn 13 9 15 2 (set (reg:V4SF 118) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 117 [ v ]) (reg:V4SF 117 [ v ])) (parallel [ (const_int 0 [0]) (const_int 0 [0]) (const_int 4 [0x4]) (const_int 4 [0x4]) ]))) t.c:12 2405 {sse_shufps_v4sf} (expr_list:REG_DEAD (reg:V4SF 117 [ v ]) (nil))) which combine doesn't manage to get down to (insn 9 4 13 2 (set (reg:V4SF 104) (vec_select:V4SF (vec_concat:V8SF (reg/v:V4SF 103 [ v ]) (reg/v:V4SF 103 [ v ])) (parallel [ (const_int 0 [0]) (const_int 0 [0]) (const_int 4 [0x4]) (const_int 4 [0x4]) ]))) t.c:18 2405 {sse_shufps_v4sf} (nil)) The testcase was the following. #include template inline float component(__v4sf v) { return (reinterpret_cast())[N]; } inline __v4sf fill(float f) { __v4sf v; *(reinterpret_cast ())=f; return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), 0)); } template inline __v4sf component_fill(__v4sf v) { return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), N) << 6) | ((N) << 4) | ((N) << 2) | (N); } __v4sf transform_bad(__v4sf m[4],__v4sf v) { return m[0]*fill(component<0>(v)) +m[1]*fill(component<1>(v)) +m[2]*fill(component<2>(v)) +m[3]*fill(component<3>(v)); } __v4sf transform_good(__v4sf m[4],__v4sf v) { return m[0]*component_fill<0>(v) +m[1]*component_fill<1>(v) +m[2]*component_fill<2>(v) +m[3]*component_fill<3>(v); }
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2016-05-10 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #6 from Richard Biener --- So what is missing here is avoiding 'v' for _26 = BIT_FIELD_REF; BIT_FIELD_REF = _26; v.1_24 = v; _25 = __builtin_ia32_shufps (v.1_24, v.1_24, 0); v ={v} {CLOBBER}; which can be done with a new BIT_FIELD_EXPR like so: v_24 = BIT_FIELD_EXPR ;
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 Bug 29756 depends on bug 28367, which changed state. Bug 28367 Summary: accessing via union on a vector does not cause vec_extract to be used https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28367 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
--- Comment #5 from pinskia at gcc dot gnu dot org 2006-11-14 01:15 --- This is mostly PR 28367. There are most likely other issues like some of the SSE intrinsics not being declared as pure/const. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added BugsThisDependsOn||28367 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
--- Comment #3 from timday at bottlenose dot demon dot co dot uk 2006-11-08 10:01 --- I've just tried an alternative version (will upload later) replacing the union with a single __v4sf _rep, and implementing the [] operators using e.g (reinterpret_castconst float*(_rep))[i]; However the code generated by the two transform implementations remains the same (20 and 32 instructions anyway; haven't checked the details yet). Maybe not surprising as it's just moving the problem around. The big difference between the two methods is perhaps primarily that the bad one involves a __v4sf-float-__vfs4 conversion, while the good one uses __v4sf throughout by using the mul_compN methods. I'll try and prepare a more concise test case based on the premise that bad handling of __v4sf - float is the real issue. -- timday at bottlenose dot demon dot co dot uk changed: What|Removed |Added CC||timday at bottlenose dot ||demon dot co dot uk http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
--- Comment #4 from timday at bottlenose dot demon dot co dot uk 2006-11-08 22:18 --- Created an attachment (id=12573) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12573action=view) More concise demonstration of the v4sf-float-v4sf issue. The attached code, (no classes or unions, just a few inline functions) obtained from gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer v4sf.cpp compiles transform_good to 18 instructions and transform_bad to 33. However it's not really surprising a round-trip through stack temporaries is required when pointer arithmetic is being used to extract a float from a __v4sf. I've no idea whether it's realistic to hope this could ever be optimised away. Alternatively, it would be very nice if the builtin vector types simply provided a [] operator, or if there were some intrinsics for extracting floats from a __v4sf. (In the meantime, in the original vector4f class, remaining in the __v4sf domain by having the const operator[] return a suitably type-wrapped __v4sf filled with the specified component seems to be a promising direction). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
--- Comment #2 from pinskia at gcc dot gnu dot org 2006-11-07 22:31 --- Looks like this is mostly caused by: union { __v4sf vecf; __m128 rawf; float val[4]; } _rep; I will have a look more at this issue later tonight when I get home from work. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Severity|minor |enhancement Component|target |middle-end Keywords||missed-optimization http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756