[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-20 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

--- Comment #15 from Jakub Jelinek  ---
Author: jakub
Date: Fri May 20 11:55:58 2016
New Revision: 236505

URL: https://gcc.gnu.org/viewcvs?rev=236505=gcc=rev
Log:
PR tree-optimization/29756
gcc.dg/tree-ssa/vector-6.c: Add -Wno-psabi -w to dg-options.
Add -msse2 for x86 and -maltivec for powerpc.  Use scan-tree-dump-times
only on selected targets where V4SImode vectors are known to be
supported.

Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-20 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

--- Comment #14 from Richard Biener  ---
Author: rguenth
Date: Fri May 20 09:17:16 2016
New Revision: 236501

URL: https://gcc.gnu.org/viewcvs?rev=236501=gcc=rev
Log:
2016-05-20  Richard Guenther  

PR tree-optimization/29756
* tree.def (BIT_INSERT_EXPR): New tcc_expression tree code.
* expr.c (expand_expr_real_2): Handle BIT_INSERT_EXPR.
* fold-const.c (operand_equal_p): Likewise.
(fold_ternary_loc): Add constant folding of BIT_INSERT_EXPR.
* gimplify.c (gimplify_expr): Handle BIT_INSERT_EXPR.
* tree-inline.c (estimate_operator_cost): Likewise.
* tree-pretty-print.c (dump_generic_node): Likewise.
* tree-ssa-operands.c (get_expr_operands): Likewise.
* cfgexpand.c (expand_debug_expr): Likewise.
* gimple-pretty-print.c (dump_ternary_rhs): Likewise.
* gimple.c (get_gimple_rhs_num_ops): Handle BIT_INSERT_EXPR.
* tree-cfg.c (verify_gimple_assign_ternary): Verify BIT_INSERT_EXPR.

* tree-ssa.c (non_rewritable_lvalue_p): We can rewrite
vector inserts using BIT_FIELD_REF or MEM_REF on the lhs.
(execute_update_addresses_taken): Do it.

* gcc.dg/tree-ssa/vector-6.c: New testcase.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/vector-6.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/cfgexpand.c
trunk/gcc/expr.c
trunk/gcc/fold-const.c
trunk/gcc/gimple-pretty-print.c
trunk/gcc/gimple.c
trunk/gcc/gimplify.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-cfg.c
trunk/gcc/tree-inline.c
trunk/gcc/tree-pretty-print.c
trunk/gcc/tree-ssa-operands.c
trunk/gcc/tree-ssa.c
trunk/gcc/tree.def

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-19 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

--- Comment #13 from rguenther at suse dot de  ---
On Thu, 19 May 2016, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
> 
> --- Comment #12 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #11)
> 
> > Index: gcc/config/i386/i386.c
> > ===
> > --- gcc/config/i386/i386.c  (revision 236441)
> > +++ gcc/config/i386/i386.c  (working copy)
> ...
> > given the plethora of shuffling intrinsics this might be quite tedious
> > work...
> 
> The builtins aren't guaranteed to be usable directly, only the intrinsics are,
> so if we want to do the above, we should just kill those builtins instead and
> use __builtin_shuffle directly in the headers (plus of course each time verify
> that we get the corresponding or better insn sequence).

Yes, but that will result in sth like

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
_mm_shuffle_ps (__m128 __A, __m128 __B, int const __mask)
{
  return (__m128) __builtin_shuffle2 ((__v4sf)__A, ((__v4sf)__B,
(__v4si) { __mask & 3, (__mask >> 2) & 3,
   ((__mask >> 4) & 3) + 4, ((__mask >> 6) & 3) + 4)
});
}

(not sure if we still need the !__OPTIMIZE__ path or what we should do for
that in general in the above context - once  !__OPTIMIZE__ would no
longer constant-fold or so)

But if this would be the prefered way of addressing this that's clearly
better than "folding" the stuff back.

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-19 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

--- Comment #12 from Jakub Jelinek  ---
(In reply to Richard Biener from comment #11)

> Index: gcc/config/i386/i386.c
> ===
> --- gcc/config/i386/i386.c  (revision 236441)
> +++ gcc/config/i386/i386.c  (working copy)
...
> given the plethora of shuffling intrinsics this might be quite tedious
> work...

The builtins aren't guaranteed to be usable directly, only the intrinsics are,
so if we want to do the above, we should just kill those builtins instead and
use __builtin_shuffle directly in the headers (plus of course each time verify
that we get the corresponding or better insn sequence).

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-19 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

--- Comment #11 from Richard Biener  ---
Like

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 236441)
+++ gcc/config/i386/i386.c  (working copy)
@@ -37745,6 +37745,23 @@ ix86_fold_builtin (tree fndecl, int n_ar
  gcc_assert (n_args == 1);
   return fold_builtin_cpu (fndecl, args);
}
+  if (fn_code == IX86_BUILTIN_SHUFPS
+ && n_args == 3
+ && TREE_CODE (args[2]) == INTEGER_CST)
+   {
+ tree mask[4];
+ tree mtype = build_vector_type (integer_type_node, 4);
+ mask[0] = build_int_cst (integer_type_node,
+  TREE_INT_CST_LOW (args[2]) & 3);
+ mask[1] = build_int_cst (integer_type_node,
+  (TREE_INT_CST_LOW (args[2]) >> 2) & 3);
+ mask[2] = build_int_cst (integer_type_node,
+  ((TREE_INT_CST_LOW (args[2]) >> 4) & 3) +
4);
+ mask[3] = build_int_cst (integer_type_node,
+  ((TREE_INT_CST_LOW (args[2]) >> 6) & 3) +
4);
+ return fold_build3 (VEC_PERM_EXPR, TREE_TYPE (TREE_TYPE (fndecl)),
+ args[0], args[1], build_vector (mtype, mask));
+   }
 }

 #ifdef SUBTARGET_FOLD_BUILTIN


given the plethora of shuffling intrinsics this might be quite tedious work...

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

Uroš Bizjak  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #10 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #9)
> Uros, see comment#8 - would that be acceptable?  The other alternative is to
> try using __builtin_shuffle[2] in the intrinsic headers but that might be
> somewhat difficult.

I have added Jakub to CC, he is the expert in various permutation approaches
for x86 target.

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-19 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

Richard Biener  changed:

   What|Removed |Added

 Target||x86_64-*-*, i?86-*-*
 CC||uros at gcc dot gnu.org

--- Comment #9 from Richard Biener  ---
Uros, see comment#8 - would that be acceptable?  The other alternative is to
try using __builtin_shuffle[2] in the intrinsic headers but that might be
somewhat difficult.

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-19 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

--- Comment #8 from Richard Biener  ---
So the remaining piece may be that of the init-regs issue.  We have

  vf_24 = BIT_INSERT_EXPR ;

which leaves the upper elements undefined, but init-regs forces them to zero.
Another issue is that in

  _26 = BIT_FIELD_REF ;
  vf_24 = BIT_INSERT_EXPR ;
  _25 = __builtin_ia32_shufps (vf_24, vf_24, 0);

the shufps is not exposed to gimple optimizations and thus we can't simplify
it in any way.  Only the backend knows that it could be simplified to

  _25 = __builtin_ia32_shufps (vf_13(D), vf_13(D), 85);

so the backend might want to "expand" __builtin_ia32_shufps to a VEC_PERM_EXPR
in its target specific builtin folding hook (making sure the reverse works
well enough obviously).

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

--- Comment #7 from Richard Biener  ---
So I have it down to a x86 combine issue:

;; v_28 = BIT_FIELD_INSERT ;

(insn 7 6 8 (set (reg:SF 116)
(vec_select:SF (reg/v:V4SF 115 [ v ])
(parallel [
(const_int 0 [0])
]))) t.c:5 -1
 (nil))

(insn 8 7 9 (set (reg:V4SF 117)
(reg/v:V4SF 109 [ v ])) t.c:11 -1
 (nil))

(insn 9 8 10 (set (reg:V4SF 117)
(vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116))
(reg:V4SF 117)
(const_int 1 [0x1]))) t.c:11 -1
 (nil))

(insn 10 9 0 (set (reg/v:V4SF 110 [ v ])
(reg:V4SF 117)) t.c:11 -1
 (nil))

that's from what vec_set_optab produces

;; _29 = __builtin_ia32_shufps (v_28, v_28, 0);

(insn 11 10 12 (set (reg:V4SF 119)
(reg/v:V4SF 110 [ v ])) t.c:12 -1
 (nil))

(insn 12 11 13 (set (reg:V4SF 120)
(reg/v:V4SF 110 [ v ])) t.c:12 -1
 (nil))

(insn 13 12 14 (set (reg:V4SF 118)
(vec_select:V4SF (vec_concat:V8SF (reg:V4SF 119)
(reg:V4SF 120))
(parallel [
(const_int 0 [0])
(const_int 0 [0])
(const_int 4 [0x4])
(const_int 4 [0x4])
]))) t.c:12 -1
 (nil))

(insn 14 13 0 (set (reg:V4SF 111 [ _29 ])
(reg:V4SF 118)) t.c:12 -1
 (nil))

and that's the shuffle.  And after combine we have

(insn 7 4 53 2 (set (reg:SF 116)
(vec_select:SF (reg/v:V4SF 115 [ v ])
(parallel [
(const_int 0 [0])
]))) t.c:5 2423 {*vec_extractv4sf_0}
 (nil))
(insn 9 53 13 2 (set (reg:V4SF 117 [ v ])
(vec_merge:V4SF (vec_duplicate:V4SF (reg:SF 116))
(const_vector:V4SF [
(const_double:SF 0.0 [0x0.0p+0])
(const_double:SF 0.0 [0x0.0p+0])
(const_double:SF 0.0 [0x0.0p+0])
(const_double:SF 0.0 [0x0.0p+0])
])
(const_int 1 [0x1]))) t.c:11 2420 {vec_setv4sf_0}
 (expr_list:REG_DEAD (reg:SF 116)
(nil)))
(insn 13 9 15 2 (set (reg:V4SF 118)
(vec_select:V4SF (vec_concat:V8SF (reg:V4SF 117 [ v ])
(reg:V4SF 117 [ v ]))
(parallel [
(const_int 0 [0])
(const_int 0 [0])
(const_int 4 [0x4])
(const_int 4 [0x4])
]))) t.c:12 2405 {sse_shufps_v4sf}
 (expr_list:REG_DEAD (reg:V4SF 117 [ v ])
(nil)))

which combine doesn't manage to get down to

(insn 9 4 13 2 (set (reg:V4SF 104)
(vec_select:V4SF (vec_concat:V8SF (reg/v:V4SF 103 [ v ])
(reg/v:V4SF 103 [ v ]))
(parallel [
(const_int 0 [0])
(const_int 0 [0])
(const_int 4 [0x4])
(const_int 4 [0x4])
]))) t.c:18 2405 {sse_shufps_v4sf}
 (nil))



The testcase was the following.

#include 

template  inline float component(__v4sf v)
{
  return (reinterpret_cast())[N];
}

inline __v4sf fill(float f)
{
  __v4sf v;
  *(reinterpret_cast())=f;
  return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), 0));
}

template  inline __v4sf component_fill(__v4sf v)
{
  return ((__m128) __builtin_ia32_shufps ((__v4sf)(v), (__v4sf)(v), 
  N) << 6) | ((N) << 4) | ((N) << 2) | (N);
}

__v4sf transform_bad(__v4sf m[4],__v4sf v)
{
  return m[0]*fill(component<0>(v))
  +m[1]*fill(component<1>(v))
  +m[2]*fill(component<2>(v))
  +m[3]*fill(component<3>(v));
}

__v4sf transform_good(__v4sf m[4],__v4sf v)
{
  return m[0]*component_fill<0>(v)
  +m[1]*component_fill<1>(v)
  +m[2]*component_fill<2>(v)
  +m[3]*component_fill<3>(v);
}

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2016-05-10
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #6 from Richard Biener  ---
So what is missing here is avoiding 'v' for

  _26 = BIT_FIELD_REF ;
  BIT_FIELD_REF  = _26;
  v.1_24 = v;
  _25 = __builtin_ia32_shufps (v.1_24, v.1_24, 0);
  v ={v} {CLOBBER};

which can be done with a new BIT_FIELD_EXPR like so:

  v_24 = BIT_FIELD_EXPR ;

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2016-05-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
Bug 29756 depends on bug 28367, which changed state.

Bug 28367 Summary: accessing via union on a vector does not cause vec_extract 
to be used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28367

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2006-11-13 Thread pinskia at gcc dot gnu dot org


--- Comment #5 from pinskia at gcc dot gnu dot org  2006-11-14 01:15 ---
This is mostly PR 28367.  There are most likely other issues like some of the
SSE intrinsics not being declared as pure/const.


-- 

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

  BugsThisDependsOn||28367


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756



[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2006-11-08 Thread timday at bottlenose dot demon dot co dot uk


--- Comment #3 from timday at bottlenose dot demon dot co dot uk  
2006-11-08 10:01 ---
I've just tried an alternative version (will upload later) replacing the union
with a single
  __v4sf _rep,
and implementing the [] operators using e.g
  (reinterpret_castconst float*(_rep))[i];
However the code generated by the two transform implementations remains the
same (20 and 32 instructions anyway; haven't checked the details yet).
Maybe not surprising as it's just moving the problem around.

The big difference between the two methods is perhaps primarily that the bad
one involves a __v4sf-float-__vfs4 conversion, while the good one uses __v4sf
throughout by using the mul_compN methods.  I'll try and prepare a more concise
test case based on the premise that bad handling of __v4sf - float is the
real issue.


-- 

timday at bottlenose dot demon dot co dot uk changed:

   What|Removed |Added

 CC||timday at bottlenose dot
   ||demon dot co dot uk


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756



[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2006-11-08 Thread timday at bottlenose dot demon dot co dot uk


--- Comment #4 from timday at bottlenose dot demon dot co dot uk  
2006-11-08 22:18 ---
Created an attachment (id=12573)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12573action=view)
More concise demonstration of the v4sf-float-v4sf issue.

The attached code, (no classes or unions, just a few inline functions) obtained
from
  gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse
-fomit-frame-pointer v4sf.cpp
compiles transform_good to 18 instructions and transform_bad to 33.  However
it's not really surprising a round-trip through stack temporaries is required
when pointer arithmetic is being used to extract a float from a __v4sf.  I've
no idea whether it's realistic to hope this could ever be optimised away. 
Alternatively, it would be very nice if the builtin vector types simply
provided a [] operator, or if there were some intrinsics for extracting floats
from a __v4sf.

(In the meantime, in the original vector4f class, remaining in the __v4sf
domain by having the const operator[] return a suitably type-wrapped __v4sf
filled with the specified component seems to be a promising direction).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756



[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing

2006-11-07 Thread pinskia at gcc dot gnu dot org


--- Comment #2 from pinskia at gcc dot gnu dot org  2006-11-07 22:31 ---
Looks like this is mostly caused by:
  union
  {
__v4sf vecf;
__m128 rawf;
float val[4];
  } _rep;

I will have a look more at this issue later tonight when I get home from work.


-- 

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

   Severity|minor   |enhancement
  Component|target  |middle-end
   Keywords||missed-optimization


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756