[Bug middle-end/81502] In some cases the data is moved to memory unnecessarily [partial regression]

2017-09-13 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81502

--- Comment #8 from Aldy Hernandez  ---
Author: aldyh
Date: Wed Sep 13 15:55:56 2017
New Revision: 252135

URL: https://gcc.gnu.org/viewcvs?rev=252135&root=gcc&view=rev
Log:
2017-07-28  Richard Biener  

PR tree-optimization/81502
* match.pd: Add pattern combining BIT_INSERT_EXPR with
BIT_FIELD_REF.
* tree-cfg.c (verify_expr): Verify types of BIT_FIELD_REF
size/pos operands.
(verify_gimple_assign_ternary): Likewise for BIT_INSERT_EXPR pos.
* gimple-fold.c (maybe_canonicalize_mem_ref_addr): Use bitsizetype
for BIT_FIELD_REF args.
* fold-const.c (make_bit_field_ref): Likewise.
* tree-vect-stmts.c (vectorizable_simd_clone_call): Likewise.

* gcc.target/i386/pr81502.c: New testcase.

Added:
branches/range-gen2/gcc/testsuite/gcc.target/i386/pr81502.c
Modified:
branches/range-gen2/gcc/ChangeLog
branches/range-gen2/gcc/fold-const.c
branches/range-gen2/gcc/gimple-fold.c
branches/range-gen2/gcc/match.pd
branches/range-gen2/gcc/testsuite/ChangeLog
branches/range-gen2/gcc/tree-cfg.c
branches/range-gen2/gcc/tree-vect-stmts.c

[Bug middle-end/81502] In some cases the data is moved to memory unnecessarily [partial regression]

2017-09-13 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81502

--- Comment #7 from Aldy Hernandez  ---
Author: aldyh
Date: Wed Sep 13 15:51:46 2017
New Revision: 252114

URL: https://gcc.gnu.org/viewcvs?rev=252114&root=gcc&view=rev
Log:
2017-07-27  Richard Biener  

PR tree-optimization/81502
* tree-ssa.c (non_rewritable_lvalue_p): Handle BIT_INSERT_EXPR
with incompatible but same sized type.
(execute_update_addresses_taken): Likewise.

* gcc.target/i386/vect-insert-1.c: New testcase.

Added:
branches/range-gen2/gcc/testsuite/gcc.target/i386/vect-insert-1.c
Modified:
branches/range-gen2/gcc/ChangeLog
branches/range-gen2/gcc/testsuite/ChangeLog
branches/range-gen2/gcc/tree-ssa.c

[Bug middle-end/81502] In some cases the data is moved to memory unnecessarily [partial regression]

2017-07-28 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81502

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
  Known to work||8.0
 Resolution|--- |FIXED

--- Comment #6 from Richard Biener  ---
Fixed.  Thanks for reporting!

[Bug middle-end/81502] In some cases the data is moved to memory unnecessarily [partial regression]

2017-07-28 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81502

--- Comment #5 from Richard Biener  ---
Author: rguenth
Date: Fri Jul 28 11:27:45 2017
New Revision: 250659

URL: https://gcc.gnu.org/viewcvs?rev=250659&root=gcc&view=rev
Log:
2017-07-28  Richard Biener  

PR tree-optimization/81502
* match.pd: Add pattern combining BIT_INSERT_EXPR with
BIT_FIELD_REF.
* tree-cfg.c (verify_expr): Verify types of BIT_FIELD_REF
size/pos operands.
(verify_gimple_assign_ternary): Likewise for BIT_INSERT_EXPR pos.
* gimple-fold.c (maybe_canonicalize_mem_ref_addr): Use bitsizetype
for BIT_FIELD_REF args.
* fold-const.c (make_bit_field_ref): Likewise.
* tree-vect-stmts.c (vectorizable_simd_clone_call): Likewise.

* gcc.target/i386/pr81502.c: New testcase.

Added:
trunk/gcc/testsuite/gcc.target/i386/pr81502.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/fold-const.c
trunk/gcc/gimple-fold.c
trunk/gcc/match.pd
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-cfg.c
trunk/gcc/tree-vect-stmts.c

[Bug middle-end/81502] In some cases the data is moved to memory unnecessarily [partial regression]

2017-07-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81502

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #4 from Richard Biener  ---
So the rewrite into SSA should work now.  I'm not sure the pattern combining
BIT_INSERT_EXPR and VECTOR_CST into a CONSTRUCTOR is desirable, a single insert
into a loaded constant vector is likely cheaper so some cost modeling would be
in order.

So from

int bar(void*) (void * ptr)
{
  int res;
  __m128i word;
  long long int _2;
  unsigned int _4;

   [100.00%] [count: INV]:
  _2 = (long long int) ptr_6(D);
  word_3 = BIT_INSERT_EXPR <{ 0, 0 }, _2, 0 (64 bits)>;
  _4 = BIT_FIELD_REF ;
  res_5 = (int) _4;
  return res_5;

the desired pattern is combining BIT_FIELD_REF and BIT_INSERT_EXPR.  We'd
combine
that into

  _4 = BIT_FIELD_REF <_2, 32, 0>;

will try to come up with a match.pd rule once time permits.

[Bug middle-end/81502] In some cases the data is moved to memory unnecessarily [partial regression]

2017-07-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81502

--- Comment #3 from Richard Biener  ---
Author: rguenth
Date: Thu Jul 27 12:01:21 2017
New Revision: 250620

URL: https://gcc.gnu.org/viewcvs?rev=250620&root=gcc&view=rev
Log:
2017-07-27  Richard Biener  

PR tree-optimization/81502
* tree-ssa.c (non_rewritable_lvalue_p): Handle BIT_INSERT_EXPR
with incompatible but same sized type.
(execute_update_addresses_taken): Likewise.

* gcc.target/i386/vect-insert-1.c: New testcase.

Added:
trunk/gcc/testsuite/gcc.target/i386/vect-insert-1.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa.c

[Bug middle-end/81502] In some cases the data is moved to memory unnecessarily [partial regression]

2017-07-21 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81502

--- Comment #2 from Richard Biener  ---
Note that with -mtune=intel we already get

_Z3barPv:
.LFB526:
.cfi_startproc
movq%rdi, %xmm0
movd%xmm0, %eax
ret

but yes, the intermediate temporary is unnecessary.  We do not optimize
this to a BIT_INSERT_EXPR because the vector element type doesn't match the
insertion quantity.  We can relax that a bit with the following:

Index: gcc/tree-ssa.c
===
--- gcc/tree-ssa.c  (revision 250386)
+++ gcc/tree-ssa.c  (working copy)
@@ -1513,8 +1513,8 @@ non_rewritable_lvalue_p (tree lhs)
   if (DECL_P (decl)
  && VECTOR_TYPE_P (TREE_TYPE (decl))
  && TYPE_MODE (TREE_TYPE (decl)) != BLKmode
- && types_compatible_p (TREE_TYPE (lhs),
-TREE_TYPE (TREE_TYPE (decl)))
+ && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),
+ TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (decl))), 0)
  && tree_fits_uhwi_p (TREE_OPERAND (lhs, 1))
  && tree_int_cst_lt (TREE_OPERAND (lhs, 1),
  TYPE_SIZE_UNIT (TREE_TYPE (decl)))
@@ -1839,8 +1839,9 @@ execute_update_addresses_taken (void)
&& bitmap_bit_p (suitable_for_renaming, DECL_UID (sym))
&& VECTOR_TYPE_P (TREE_TYPE (sym))
&& TYPE_MODE (TREE_TYPE (sym)) != BLKmode
-   && types_compatible_p (TREE_TYPE (lhs),
-  TREE_TYPE (TREE_TYPE (sym)))
+   && operand_equal_p (TYPE_SIZE_UNIT (TREE_TYPE (lhs)),
+   TYPE_SIZE_UNIT
+ (TREE_TYPE (TREE_TYPE (sym))), 0)
&& tree_fits_uhwi_p (TREE_OPERAND (lhs, 1))
&& tree_int_cst_lt (TREE_OPERAND (lhs, 1),
TYPE_SIZE_UNIT (TREE_TYPE (sym)))
@@ -1848,6 +1849,18 @@ execute_update_addresses_taken (void)
% tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (lhs ==
0)
  {
tree val = gimple_assign_rhs1 (stmt);
+   if (! types_compatible_p (TREE_TYPE (lhs),
+ TREE_TYPE (TREE_TYPE (sym
+ {
+   tree tem = make_ssa_name (TREE_TYPE (TREE_TYPE (sym)));
+   gimple *pun
+ = gimple_build_assign (tem,
+build1 (VIEW_CONVERT_EXPR,
+TREE_TYPE (TREE_TYPE
+ (sym)),
val));
+   gsi_insert_before (&gsi, pun, GSI_SAME_STMT);
+   val = tem;
+ }
tree bitpos
  = wide_int_to_tree (bitsizetype,
  mem_ref_offset (lhs) *
BITS_PER_UNIT);

this gets us to

int bar(void*) (void * ptr)
{
  int res;
  __m128i word;
  long long int _2;
  unsigned int _4;

   [100.00%] [count: INV]:
  _2 = (long long int) ptr_6(D);
  word_3 = BIT_INSERT_EXPR <{ 0, 0 }, _2, 0 (64 bits)>;
  _4 = BIT_FIELD_REF ;
  res_5 = (int) _4;
  return res_5;

in .optimized which shows (already known) missed foldings for bit-field-ref
of bit-insert.  That's a complicated one btw, extracting a component from
a vector insert.

Oh, and it misses bit-insert -> CONSTRUCTOR, thus

word_3 = { _2, 0 };

(simplify
 (bit_insert VECTOR_CST@0 @1 @2)
 {
   vec *v;
   vec_alloc (v, TYPE_VECTOR_SUBPARTS (type));
   for (unsigned i = 0; i < VECTOR_CST_NELTS (@0); ++i)
 {
   constructor_elt elt = { NULL_TREE, VECTOR_CST_ELT (@0, i) };
   v->quick_push (elt);
 }
   (*v)[TREE_INT_CST_LOW (@2) 
/ TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (type)))].value = @1;
   build_constructor (type, v);
 })

that gets us to

   [100.00%] [count: INV]:
  _2 = (long long int) ptr_6(D);
  word_3 = {_2, 0};
  _4 = BIT_FIELD_REF ;
  res_5 = (int) _4;
  return res_5;

where we still need that BIT_FIELD_REF simplification.  The IL is already
in this form when we run into FRE1 so handling it there should be
possible in principle.  Or we can fold

  word_3 = {_2, 0};
  _4 = BIT_FIELD_REF ;

to

  _4 = BIT_FIELD_REF <_2, 32, 0 [+adjustment]>;

thus a BIT_FIELD_REF on a CONSTRUCTOR to one on the element.

[Bug middle-end/81502] In some cases the data is moved to memory unnecessarily [partial regression]

2017-07-20 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81502

Marc Glisse  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-07-21
 Ever confirmed|0   |1

--- Comment #1 from Marc Glisse  ---
.optimized dump:

int bar(void*) (void * ptr)
{
  int res;
  __m128i word;
  long unsigned int _2;
  vector(2) long long int word.3_3;
  unsigned int _4;

   [100.00%] [count: INV]:
  _2 = (long unsigned int) ptr_9(D);
  word = { 0, 0 };
  MEM[(char * {ref-all})&word] = _2;
  word.3_3 = word;
  word ={v} {CLOBBER};
  _4 = BIT_FIELD_REF ;
  res_5 = (int) _4;
  return res_5;

}

We missed turning the memory write into a BIT_INSERT_EXPR, and passes like PRE
missed following the bit_field_expr all the way to _2.

.combine dump:
[...]
(insn 8 3 10 2 (set (reg/v:V2DI 90 [ word ])
(vec_concat:V2DI (reg/v/f:DI 92 [ ptr ])
(const_int 0 [0]))) "b.c":16 3712 {vec_concatv2di}
 (expr_list:REG_DEAD (reg/v/f:DI 92 [ ptr ])
(nil)))
(insn 10 8 15 2 (set (reg:SI 94 [ res ])
(vec_select:SI (subreg:V4SI (reg/v:V2DI 90 [ word ]) 0)
(parallel [
(const_int 0 [0])
]))) "b.c":20 3697 {*vec_extractv4si_0}
 (expr_list:REG_DEAD (reg/v:V2DI 90 [ word ])
(nil)))
[...]

combine tries
(set (reg:SI 94 [ res ])
(vec_select:SI (subreg:V4SI (vec_concat:V2DI (reg/v/f:DI 92 [ ptr ])
(const_int 0 [0])) 0)
(parallel [
(const_int 0 [0])
])))
which we fail to simplify. The xmm1-xmm0 mov is not considered a mov by the
compiler but concatenation with 0, so not a RA problem.

The change of mode (64-bit pointer to 32-bit int) seems to play a big role in
confusing things here.