[PATCH] Fix incorrect computation in fill_always_executed_in_1

2021-08-16 Thread Xiong Hu Luo via Gcc-patches
It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
nested loops.  inn_loop is updated to inner loop, so it need be restored
when exiting from innermost loop. With this patch, the store instruction
in outer loop could also be moved out of outer loop by store motion.
Any comments?  Thanks.

gcc/ChangeLog:

* tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
inn_loop when exiting from innermost loop.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++
 gcc/tree-ssa-loop-im.c |  6 +-
 2 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 000..097a5ee4a4b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,24 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l)
+{
+  for (int j = 0; j < l; j++)
+{
+  for (int i = 0; i < n; ++i)
+   {
+ int *p = >j;
+ int tem = *p;
+ x->j += tem * i;
+   }
+  int *r = >k;
+  int tem2 = *r;
+  x->k += tem2 * j;
+}
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b24bc64f2a7..5ca4738b20e 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap 
contains_call)
  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
last = bb;
 
+ if (inn_loop != loop
+ && flow_loop_nested_p (bb->loop_father, inn_loop))
+   inn_loop = bb->loop_father;
+
  if (bitmap_bit_p (contains_call, bb->index))
break;
 
@@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap 
contains_call)
 
  if (bb->loop_father->header == bb)
{
- if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
+ if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, bb))
break;
 
  /* In a loop that is always entered we may proceed anyway.
-- 
2.27.0.90.geebb51ba8c



[RFC] Don't move cold code out of loop by checking bb count

2021-08-01 Thread Xiong Hu Luo via Gcc-patches
There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

This patch does this profile count check in both gimple LIM
move_computations_worker and RTL loop-invariant.c find_invariants_bb,
if the loop bb is colder than loop preheader, don't hoist it out of
loop.

Also, the profile count in loop split pass should be corrected to avoid
lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
move statement out of loop unexpectely when lim2 didn't move it.  This
change could fix regression on 544.nab_r from -1.55% to +0.46%.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

Regression and bootstrap tested pass on P8LE, any comments?  Thanks.

gcc/ChangeLog:

* loop-invariant.c (find_invariants_bb): Check profile count
before motion.
(find_invariants_body): Add argument.
* tree-ssa-loop-im.c (move_computations_worker): Check profile
count before motion.
(execute_sm): Likewise.
(execute_sm_exit): Check pointer validness.
* tree-ssa-loop-split.c (split_loop): Correct probability.
(do_split_loop_on_cond): Likewise.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/recip-3.c: Adjust.
---
 gcc/loop-invariant.c|  10 +-
 gcc/testsuite/gcc.dg/tree-ssa/recip-3.c |   2 +-
 gcc/tree-ssa-loop-im.c  | 164 +++-
 gcc/tree-ssa-loop-split.c   |  14 +-
 4 files changed, 177 insertions(+), 13 deletions(-)

diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index bdc7b59dd5f..7b5d64d11f9 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool 
always_reached, bool always_executed)
call.  */
 
 static void
-find_invariants_bb (basic_block bb, bool always_reached, bool always_executed)
+find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
+   bool always_executed)
 {
   rtx_insn *insn;
+  basic_block preheader = loop_preheader_edge (loop)->src;
+
+  if (preheader->count > bb->count)
+return;
 
   FOR_BB_INSNS (bb, insn)
 {
@@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body,
   unsigned i;
 
   for (i = 0; i < loop->num_nodes; i++)
-find_invariants_bb (body[i],
-   bitmap_bit_p (always_reached, i),
+find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
bitmap_bit_p (always_executed, i));
 }
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c 
b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@ float h ()
F[0] += E / d;
 }
 
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 7de47edbcb3..2bfb5e8ec15 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb)
  continue;
}
 
+  edge e = loop_preheader_edge (level);
+  if (e->src->count > bb->count)
+   {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   {
+ fprintf (dump_file, "PHI node NOT moved to %d from %d:\n",
+  e->src->index, bb->index);
+ print_gimple_stmt (dump_file, stmt, 0);
+ fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
+  level->num);
+   }
+ gsi_next ();
+ continue;
+   }
+  else
+   {
+ unsigned i;
+ bool skip_phi_move = false;
+ for (i = 0; i < gimple_phi_num_args (stmt); i++)
+   {
+ tree def = PHI_ARG_DEF (stmt, i);
+
+ if (TREE_CODE (def) != SSA_NAME)
+   continue;
+
+ gimple *def_stmt = SSA_NAME_DEF_STMT (def);
+
+ if (!gimple_bb (def_stmt))
+   continue;
+
+ if (!dominated_by_p (CDI_DOMINATORS, e->src,
+  gimple_bb (def_stmt)))
+   {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   {
+ fprintf (dump_file,
+  "PHI node NOT moved to %d [local count:%d] from "
+  "%d [local 

[PATCH] rs6000: Expand fmod and remainder when built with fast-math [PR97142]

2021-04-16 Thread Xiong Hu Luo via Gcc-patches
fmod/fmodf and remainder/remainderf could be expanded instead of library
call when fast-math build, which is much faster.

fmodf:
 fdivs   f0,f1,f2
 frizf0,f0
 fnmsubs f1,f2,f0,f1

remainderf:
 fdivs   f0,f1,f2
 frinf0,f0
 fnmsubs f1,f2,f0,f1

gcc/ChangeLog:

2021-04-16  Xionghu Luo  

PR target/97142
* config/rs6000/rs6000.md (fmod3): New define_expand.
(remainder3): Likewise.

gcc/testsuite/ChangeLog:

2021-04-16  Xionghu Luo  

PR target/97142
* gcc.target/powerpc/pr97142.c: New test.
---
 gcc/config/rs6000/rs6000.md| 36 ++
 gcc/testsuite/gcc.target/powerpc/pr97142.c | 30 ++
 2 files changed, 66 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr97142.c

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index a1315523fec..7e0e94e6ba4 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -4902,6 +4902,42 @@ (define_insn "fre"
   [(set_attr "type" "fp")
(set_attr "isa" "*,")])
 
+(define_expand "fmod3"
+  [(use (match_operand:SFDF 0 "gpc_reg_operand"))
+   (use (match_operand:SFDF 1 "gpc_reg_operand"))
+   (use (match_operand:SFDF 2 "gpc_reg_operand"))]
+  "TARGET_HARD_FLOAT
+  && TARGET_FPRND
+  && flag_unsafe_math_optimizations"
+{
+  rtx div = gen_reg_rtx (mode);
+  emit_insn (gen_div3 (div, operands[1], operands[2]));
+
+  rtx friz = gen_reg_rtx (mode);
+  emit_insn (gen_btrunc2 (friz, div));
+
+  emit_insn (gen_nfms4 (operands[0], operands[2], friz, operands[1]));
+  DONE;
+ })
+
+(define_expand "remainder3"
+  [(use (match_operand:SFDF 0 "gpc_reg_operand"))
+   (use (match_operand:SFDF 1 "gpc_reg_operand"))
+   (use (match_operand:SFDF 2 "gpc_reg_operand"))]
+  "TARGET_HARD_FLOAT
+  && TARGET_FPRND
+  && flag_unsafe_math_optimizations"
+{
+  rtx div = gen_reg_rtx (mode);
+  emit_insn (gen_div3 (div, operands[1], operands[2]));
+
+  rtx frin = gen_reg_rtx (mode);
+  emit_insn (gen_round2 (frin, div));
+
+  emit_insn (gen_nfms4 (operands[0], operands[2], frin, operands[1]));
+  DONE;
+ })
+
 (define_insn "*rsqrt2"
   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=,wa")
(unspec:SFDF [(match_operand:SFDF 1 "gpc_reg_operand" ",wa")]
diff --git a/gcc/testsuite/gcc.target/powerpc/pr97142.c 
b/gcc/testsuite/gcc.target/powerpc/pr97142.c
new file mode 100644
index 000..48f25ca5b5b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr97142.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast" } */
+
+#include 
+
+float test1 (float x, float y)
+{
+  return fmodf (x, y);
+}
+
+double test2 (double x, double y)
+{
+  return fmod (x, y);
+}
+
+float test3 (float x, float y)
+{
+  return remainderf (x, y);
+}
+
+double test4 (double x, double y)
+{
+  return remainder (x, y);
+}
+
+/* { dg-final { scan-assembler-not {\mbl fmod\M} } } */
+/* { dg-final { scan-assembler-not {\mbl fmodf\M} } } */
+/* { dg-final { scan-assembler-not {\mbl remainder\M} } } */
+/* { dg-final { scan-assembler-not {\mbl remainderf\M} } } */
+
-- 
2.27.0.90.geebb51ba8c



[RFC] Run pass_sink_code once more after ivopts/fre

2020-12-21 Thread Xiong Hu Luo via Gcc-patches
Here comes another case that requires run a pass once more, as this is
not the common suggested direction to solve problems, not quite sure
whether it is still a reasonble fix here.  Source code is something like:

ref = ip + *hslot;
while (ip < in_end - 2) {
  unsigned int len = 2;
  len++;
for ()   {
  do len++;
  while (len < maxlen && ref[len] == ip[len]); //sink code here.
  break;
}
  len -= 2;
  ip++;
  ip += len + 1;
  if (ip >= in_end - 2)
break;
}

Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:

   [local count: 75120046]:
  # len_160 = PHI 
  len_189 = len_160 + 1;
  _423 = (sizetype) len_189;
  _424 = ip_229 + _423;
  if (maxlen_186 > len_189)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 70988443]:
  _84 = *_424;
  _86 = ref_182 + _423;
  _87 = *_86;
  if (_84 == _87)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 67084079]:
  goto ; [100.00%]

   [local count: 14847855]:
  # len_263 = PHI 
  # _262 = PHI <_423(32), _423(31)>
  # _264 = PHI <_424(32), _424(31)>
  len_190 = len_263 + 4294967295;
  if (len_190 <= 6)
goto ; [0.00%]
  else
goto ; [100.00%]

Then in ivopts, instructions are updated to xxx.c.174t.ivopts:

   [local count: 75120046]:
  # ivtmp.30_29 = PHI 
  _34 = (unsigned int) ivtmp.30_29;
  len_160 = _34 + 4294967295;
  _423 = ivtmp.30_29;
  _35 = (unsigned long) ip_229;
  _420 = ivtmp.30_29 + _35;
  _419 = (uint8_t *) _420;
  _424 = _419;
  len_418 = (unsigned int) ivtmp.30_29;
  if (maxlen_186 > len_418)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 70988443]:
  _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
  ivtmp.30_31 = ivtmp.30_29 + 1;
  _417 = ref_182 + 18446744073709551615;
  _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
  if (_84 == _87)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 67084079]:
  goto ; [100.00%]

   [local count: 14847855]:
  # len_263 = PHI 
  # _262 = PHI <_423(32), _423(31)>
  # _264 = PHI <_424(32), _424(31)>
  len_190 = len_263 + 4294967295;
  if (len_190 <= 6)
goto ; [0.00%]
  else
goto ; [100.00%]

Some instructions in BB 31 are not used in the loop and could be sinked
out of loop to reduce the computation, but they are not sinked
throughout all passes later.  Run the sink_code pass once more at least
after fre5 could improve this typical case performance 23% due to few
instructions exausted in loop.
xxx.c.209t.sink2:

Sinking _419 = (uint8_t *) _420;
 from bb 31 to bb 89
Sinking _420 = ivtmp.30_29 + _35;
 from bb 31 to bb 89
Sinking _35 = (unsigned long) ip_229;
 from bb 31 to bb 89
Sinking len_160 = _34 + 4294967295;
 from bb 31 to bb 33

I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
small with 0.25%.

The reason why it should be run after fre5 is fre would do some phi
optimization to expose the optimization.  The patch put it after
pass_modref is due to my guess that some gimple optimizations like
thread_jumps, dse, dce etc. could provide more opportunities for
sinking code.  Not sure it is the correct place to put.  I also
verified this issue exists in both X86 and ARM64.
Any comments?  Thanks.
---
 gcc/passes.def  | 1 +
 gcc/tree-ssa-sink.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/gcc/passes.def b/gcc/passes.def
index 21b2e2af0f7..69106615729 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -355,6 +355,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_uncprop);
   NEXT_PASS (pass_local_pure_const);
   NEXT_PASS (pass_modref);
+  NEXT_PASS (pass_sink_code);
   POP_INSERT_PASSES ()
   NEXT_PASS (pass_all_optimizations_g);
   PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations_g)
diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c
index b0abf4147d6..824659f3919 100644
--- a/gcc/tree-ssa-sink.c
+++ b/gcc/tree-ssa-sink.c
@@ -819,6 +819,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_sink != 0; }
   virtual unsigned int execute (function *);
+  opt_pass *clone (void) { return new pass_sink_code (m_ctxt); }
 
 }; // class pass_sink_code
 
-- 
2.27.0.90.geebb51ba8c



[PATCH] Add debug_bb_details and debug_bb_n_details

2020-10-23 Thread Xiong Hu Luo via Gcc-patches
Sometimes debug_bb_slim_bb_n_slim is not enough, how about adding
this debug_bb_details_bb_n_details? Or any other similar call
existed?

gcc/ChangeLog:

2020-10-23  Xionghu Luo  

* print-rtl.c (debug_bb_details): New function.
* (debug_bb_n_details): New function.
---
 gcc/print-rtl.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/gcc/print-rtl.c b/gcc/print-rtl.c
index 25265efc71b..f45873b8863 100644
--- a/gcc/print-rtl.c
+++ b/gcc/print-rtl.c
@@ -2150,6 +2150,21 @@ debug_bb_n_slim (int n)
   debug_bb_slim (bb);
 }
 
+extern void debug_bb_details (basic_block);
+DEBUG_FUNCTION void
+debug_bb_details (basic_block bb)
+{
+  dump_bb (stderr, bb, 0, TDF_DETAILS | TDF_BLOCKS);
+}
+
+extern void debug_bb_n_details (int);
+DEBUG_FUNCTION void
+debug_bb_n_details (int n)
+{
+  basic_block bb = BASIC_BLOCK_FOR_FN (cfun, n);
+  debug_bb_details (bb);
+}
+
 #endif
 
 #if __GNUC__ >= 10
-- 
2.27.0.90.geebb51ba8c



[PATCH v2 1/2] IFN: Implement IFN_VEC_SET for ARRAY_REF with VIEW_CONVERT_EXPR

2020-09-18 Thread Xiong Hu Luo via Gcc-patches
This patch enables transformation from ARRAY_REF(VIEW_CONVERT_EXPR) to
VEC_SET internal function in gimple-isel pass if target supports
vec_set with variable index by checking can_vec_set_var_idx_p.

gcc/ChangeLog:

2020-09-18  Xionghu Luo  

* gimple-isel.cc (gimple_expand_vec_set_expr): New function.
(gimple_expand_vec_cond_exprs): Call gimple_expand_vec_set_expr.
* internal-fn.c (vec_set_direct): New define.
(expand_vec_set_optab_fn): New function.
(direct_vec_set_optab_supported_p): New define.
* internal-fn.def (VEC_SET): New DEF_INTERNAL_OPTAB_FN.
* optabs.c (can_vec_set_var_idx_p): New function.
* optabs.h (can_vec_set_var_idx_p): New declare.
---
 gcc/gimple-isel.cc  | 116 +++-
 gcc/internal-fn.c   |  36 ++
 gcc/internal-fn.def |   2 +
 gcc/optabs.c|  17 +++
 gcc/optabs.h|   3 ++
 5 files changed, 172 insertions(+), 2 deletions(-)

diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc
index b330cf4c20e..bc61e2895be 100644
--- a/gcc/gimple-isel.cc
+++ b/gcc/gimple-isel.cc
@@ -35,6 +35,80 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-cfg.h"
 #include "bitmap.h"
 #include "tree-ssa-dce.h"
+#include "fold-const.h"
+#include "gimple-fold.h"
+#include "memmodel.h"
+#include "optabs.h"
+
+/* Expand all ARRAY_REF(VIEW_CONVERT_EXPR) gimple assignments into calls to
+   internal function based on vector type of selected expansion.
+   i.e.:
+ VIEW_CONVERT_EXPR(u)[_1] =  = i_4(D);
+   =>
+ _7 = u;
+ _8 = .VEC_SET (_7, i_4(D), _1);
+ u = _8;  */
+
+static gimple *
+gimple_expand_vec_set_expr (gimple_stmt_iterator *gsi)
+{
+  enum tree_code code;
+  gcall *new_stmt = NULL;
+  gassign *ass_stmt = NULL;
+
+  /* Only consider code == GIMPLE_ASSIGN.  */
+  gassign *stmt = dyn_cast (gsi_stmt (*gsi));
+  if (!stmt)
+return NULL;
+
+  code = TREE_CODE (gimple_assign_lhs (stmt));
+  if (code != ARRAY_REF)
+return NULL;
+
+  tree lhs = gimple_assign_lhs (stmt);
+  tree val = gimple_assign_rhs1 (stmt);
+
+  tree type = TREE_TYPE (lhs);
+  tree op0 = TREE_OPERAND (lhs, 0);
+  if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
+  && tree_fits_uhwi_p (TYPE_SIZE (type)))
+{
+  tree pos = TREE_OPERAND (lhs, 1);
+  tree view_op0 = TREE_OPERAND (op0, 0);
+  machine_mode outermode = TYPE_MODE (TREE_TYPE (view_op0));
+  scalar_mode innermode = GET_MODE_INNER (outermode);
+  tree_code code = TREE_CODE (TREE_TYPE(view_op0));
+  if (!is_global_var (view_op0) && code == VECTOR_TYPE
+ && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0)))
+ && can_vec_set_var_idx_p (code, outermode, innermode,
+   TYPE_MODE (TREE_TYPE (pos
+   {
+ location_t loc = gimple_location (stmt);
+ tree var_src = make_ssa_name (TREE_TYPE (view_op0));
+ tree var_dst = make_ssa_name (TREE_TYPE (view_op0));
+
+ ass_stmt = gimple_build_assign (var_src, view_op0);
+ gimple_set_vuse (ass_stmt, gimple_vuse (stmt));
+ gimple_set_location (ass_stmt, loc);
+ gsi_insert_before (gsi, ass_stmt, GSI_SAME_STMT);
+
+ new_stmt
+   = gimple_build_call_internal (IFN_VEC_SET, 3, var_src, val, pos);
+ gimple_call_set_lhs (new_stmt, var_dst);
+ gimple_set_location (new_stmt, loc);
+ gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
+
+ ass_stmt = gimple_build_assign (view_op0, var_dst);
+ gimple_set_location (ass_stmt, loc);
+ gsi_insert_before (gsi, ass_stmt, GSI_SAME_STMT);
+
+ gimple_move_vops (ass_stmt, stmt);
+ gsi_remove (gsi, true);
+   }
+}
+
+  return ass_stmt;
+}
 
 /* Expand all VEC_COND_EXPR gimple assignments into calls to internal
function based on type of selected expansion.  */
@@ -187,8 +261,25 @@ gimple_expand_vec_cond_exprs (void)
 {
   for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next ())
{
- gimple *g = gimple_expand_vec_cond_expr (,
-  _cond_ssa_name_uses);
+ gassign *stmt = dyn_cast (gsi_stmt (gsi));
+ if (!stmt)
+   continue;
+
+ enum tree_code code;
+ gimple *g = NULL;
+ code = gimple_assign_rhs_code (stmt);
+ switch (code)
+   {
+   case VEC_COND_EXPR:
+ g = gimple_expand_vec_cond_expr (, _cond_ssa_name_uses);
+ break;
+   case ARRAY_REF:
+ /*  TODO: generate IFN for vec_extract with variable index.  */
+ break;
+   default:
+ break;
+   }
+
  if (g != NULL)
{
  tree lhs = gimple_assign_lhs (gsi_stmt (gsi));
@@ -204,6 +295,27 @@ gimple_expand_vec_cond_exprs (void)
 
   simple_dce_from_worklist (dce_ssa_names);
 
+  FOR_EACH_BB_FN (bb, cfun)
+{
+  for (gsi = gsi_start_bb (bb); 

[PATCH v2 2/2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-18 Thread Xiong Hu Luo via Gcc-patches
vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
to be insert, arg2 is the place to insert arg1 to arg0.  Current expander
generates stxv+stwx+lxv if arg2 is variable instead of constant, which
causes serious store hit load performance issue on Power.  This patch tries
 1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n&3] = i to
unify the gimple code, then expander could use vec_set_optab to expand.
 2) Expand the IFN VEC_SET to fast instructions: lvsl+xxperm+xxsel.
In this way, "vec_insert (i, v, n)" and "v[n&3] = i" won't be expanded too
early in gimple stage if arg2 is variable, avoid generating store hit load
instructions.

For Power9 V4SI:
addi 9,1,-16
rldic 6,6,2,60
stxv 34,-16(1)
stwx 5,9,6
lxv 34,-16(1)
=>
addis 9,2,.LC0@toc@ha
addi 9,9,.LC0@toc@l
mtvsrwz 33,5
lxv 32,0(9)
sradi 9,6,2
addze 9,9
sldi 9,9,2
subf 9,9,6
subfic 9,9,3
sldi 9,9,2
subfic 9,9,20
lvsl 13,0,9
xxperm 33,33,45
xxperm 32,32,45
xxsel 34,34,33,32

Though instructions increase from 5 to 15, the performance is improved
60% in typical cases.
Tested with V2DI, V2DF V4SI, V4SF, V8HI, V16QI on Power9-LE and
Power8-BE, bootstrap tested pass.

gcc/ChangeLog:

2020-09-18  Xionghu Luo  

* config/rs6000/altivec.md (altivec_lvsl_reg_2): Rename to
 (altivec_lvsl_reg_2) and extend to SDI mode.
* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
Ajdust variable index vec_insert to VIEW_CONVERT_EXPR.
* config/rs6000/rs6000-protos.h (rs6000_expand_vector_set_var):
New declare.
* config/rs6000/rs6000.c (rs6000_expand_vector_set_var):
New function.
* config/rs6000/rs6000.md (FQHS): New mode iterator.
(FD): New mode iterator.
p8_mtvsrwz_v16qi2: New define_insn.
p8_mtvsrd_v16qi2: New define_insn.
* config/rs6000/vector.md: Add register operand2 match for
vec_set index.
* config/rs6000/vsx.md: Call gen_altivec_lvsl_reg_di2.

gcc/testsuite/ChangeLog:

2020-09-18  Xionghu Luo  

* gcc.target/powerpc/pr79251.c: New test.
* gcc.target/powerpc/pr79251-run.c: New test.
* gcc.target/powerpc/pr79251.h: New header.
---
 gcc/config/rs6000/altivec.md  |   4 +-
 gcc/config/rs6000/rs6000-c.c  |  22 ++-
 gcc/config/rs6000/rs6000-protos.h |   1 +
 gcc/config/rs6000/rs6000.c| 146 ++
 gcc/config/rs6000/rs6000.md   |  19 +++
 gcc/config/rs6000/vector.md   |  19 ++-
 gcc/config/rs6000/vsx.md  |   2 +-
 .../gcc.target/powerpc/pr79251-run.c  |  29 
 gcc/testsuite/gcc.target/powerpc/pr79251.c|  15 ++
 gcc/testsuite/gcc.target/powerpc/pr79251.h|  19 +++
 10 files changed, 257 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251-run.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251.h

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index 0a2e634d6b0..66b636059a6 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -2772,10 +2772,10 @@ (define_expand "altivec_lvsl"
   DONE;
 })
 
-(define_insn "altivec_lvsl_reg"
+(define_insn "altivec_lvsl_reg_2"
   [(set (match_operand:V16QI 0 "altivec_register_operand" "=v")
(unspec:V16QI
-   [(match_operand:DI 1 "gpc_reg_operand" "b")]
+   [(match_operand:SDI 1 "gpc_reg_operand" "b")]
UNSPEC_LVSL_REG))]
   "TARGET_ALTIVEC"
   "lvsl %0,0,%1"
diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
index 2fad3d94706..78abe49c833 100644
--- a/gcc/config/rs6000/rs6000-c.c
+++ b/gcc/config/rs6000/rs6000-c.c
@@ -1509,9 +1509,7 @@ altivec_resolve_overloaded_builtin (location_t loc, tree 
fndecl,
   tree arg1;
   tree arg2;
   tree arg1_type;
-  tree arg1_inner_type;
   tree decl, stmt;
-  tree innerptrtype;
   machine_mode mode;
 
   /* No second or third arguments. */
@@ -1563,8 +1561,13 @@ altivec_resolve_overloaded_builtin (location_t loc, tree 
fndecl,
  return build_call_expr (call, 3, arg1, arg0, arg2);
}
 
-  /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
-  arg1_inner_type = TREE_TYPE (arg1_type);
+  /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0 with
+VIEW_CONVERT_EXPR.  i.e.:
+D.3192 = v1;
+_1 = n & 3;
+VIEW_CONVERT_EXPR(D.3192)[_1] = i;
+v1 = D.3192;
+D.3194 = v1;  */
   if (TYPE_VECTOR_SUBPARTS (arg1_type) == 1)
arg2 = build_int_cst (TREE_TYPE (arg2), 0);
   else
@@ -1593,15 +1596,8 @@ altivec_resolve_overloaded_builtin (location_t loc, tree 
fndecl,
  SET_EXPR_LOCATION 

[PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-08-31 Thread Xiong Hu Luo via Gcc-patches
vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
__builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
not expand too early in gimple stage if arg2 is variable, to avoid generate
store hit load instructions.

For Power9 V4SI:
addi 9,1,-16
rldic 6,6,2,60
stxv 34,-16(1)
stwx 5,9,6
lxv 34,-16(1)
=>
addis 9,2,.LC0@toc@ha
addi 9,9,.LC0@toc@l
mtvsrwz 33,5
lxv 32,0(9)
sradi 9,6,2
addze 9,9
sldi 9,9,2
subf 9,9,6
subfic 9,9,3
sldi 9,9,2
subfic 9,9,20
lvsl 13,0,9
xxperm 33,33,45
xxperm 32,32,45
xxsel 34,34,33,32

Though instructions increase from 5 to 15, the performance is improved
60% in typical cases.

gcc/ChangeLog:

* config/rs6000/altivec.md (altivec_lvsl_reg_2): Extend to
SDI mode.
* config/rs6000/rs6000-builtin.def (BU_VSX_X): Add support
macros for vec_insert built-in functions.
* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
Generate built-in calls for vec_insert.
* config/rs6000/rs6000-call.c (altivec_expand_vec_insert_builtin):
New function.
(altivec_expand_builtin): Add case entry for
VSX_BUILTIN_VEC_INSERT_V16QI, VSX_BUILTIN_VEC_INSERT_V8HI,
VSX_BUILTIN_VEC_INSERT_V4SF,  VSX_BUILTIN_VEC_INSERT_V4SI,
VSX_BUILTIN_VEC_INSERT_V2DF,  VSX_BUILTIN_VEC_INSERT_V2DI.
(altivec_init_builtins):
* config/rs6000/rs6000-protos.h (rs6000_expand_vector_insert):
New declear.
* config/rs6000/rs6000.c (rs6000_expand_vector_insert):
New function.
* config/rs6000/rs6000.md (FQHS): New mode iterator.
(FD): New mode iterator.
p8_mtvsrwz_v16qi2: New define_insn.
p8_mtvsrd_v16qi2: New define_insn.
* config/rs6000/vsx.md: Call gen_altivec_lvsl_reg_di2.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr79251.c: New test.
---
 gcc/config/rs6000/altivec.md   |   4 +-
 gcc/config/rs6000/rs6000-builtin.def   |   6 +
 gcc/config/rs6000/rs6000-c.c   |  61 +
 gcc/config/rs6000/rs6000-call.c|  74 +++
 gcc/config/rs6000/rs6000-protos.h  |   1 +
 gcc/config/rs6000/rs6000.c | 146 +
 gcc/config/rs6000/rs6000.md|  19 +++
 gcc/config/rs6000/vsx.md   |   2 +-
 gcc/testsuite/gcc.target/powerpc/pr79251.c |  23 
 9 files changed, 333 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251.c

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index 0a2e634d6b0..66b636059a6 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -2772,10 +2772,10 @@
   DONE;
 })
 
-(define_insn "altivec_lvsl_reg"
+(define_insn "altivec_lvsl_reg_2"
   [(set (match_operand:V16QI 0 "altivec_register_operand" "=v")
(unspec:V16QI
-   [(match_operand:DI 1 "gpc_reg_operand" "b")]
+   [(match_operand:SDI 1 "gpc_reg_operand" "b")]
UNSPEC_LVSL_REG))]
   "TARGET_ALTIVEC"
   "lvsl %0,0,%1"
diff --git a/gcc/config/rs6000/rs6000-builtin.def 
b/gcc/config/rs6000/rs6000-builtin.def
index f9f0fece549..d095b365c14 100644
--- a/gcc/config/rs6000/rs6000-builtin.def
+++ b/gcc/config/rs6000/rs6000-builtin.def
@@ -2047,6 +2047,12 @@ BU_VSX_X (VEC_INIT_V2DI,  "vec_init_v2di",   CONST)
 BU_VSX_X (VEC_SET_V1TI,  "vec_set_v1ti",   CONST)
 BU_VSX_X (VEC_SET_V2DF,  "vec_set_v2df",   CONST)
 BU_VSX_X (VEC_SET_V2DI,  "vec_set_v2di",   CONST)
+BU_VSX_X (VEC_INSERT_V16QI,  "vec_insert_v16qi",   CONST)
+BU_VSX_X (VEC_INSERT_V8HI,   "vec_insert_v8hi",CONST)
+BU_VSX_X (VEC_INSERT_V4SI,   "vec_insert_v4si",CONST)
+BU_VSX_X (VEC_INSERT_V4SF,   "vec_insert_v4sf",CONST)
+BU_VSX_X (VEC_INSERT_V2DI,   "vec_insert_v2di",CONST)
+BU_VSX_X (VEC_INSERT_V2DF,   "vec_insert_v2df",CONST)
 BU_VSX_X (VEC_EXT_V1TI,  "vec_ext_v1ti",   CONST)
 BU_VSX_X (VEC_EXT_V2DF,  "vec_ext_v2df",   CONST)
 BU_VSX_X (VEC_EXT_V2DI,  "vec_ext_v2di",   CONST)
diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
index 2fad3d94706..03b00738a5e 100644
--- a/gcc/config/rs6000/rs6000-c.c
+++ b/gcc/config/rs6000/rs6000-c.c
@@ -1563,6 +1563,67 @@ altivec_resolve_overloaded_builtin (location_t loc, tree 
fndecl,
  return build_call_expr (call, 3, arg1, arg0, arg2);
}
 
+  else if (VECTOR_MEM_VSX_P (mode))
+   {
+ tree call = NULL_TREE;
+
+ arg2 = fold_for_warn (arg2);
+
+ /* If the second argument is variable, we can optimize it if we are
+generating 64-bit code on a machine with direct move.  */
+   

[PATCH] dse: Remove partial load after full store for high part access[PR71309]

2020-07-21 Thread Xiong Hu Luo via Gcc-patches
This patch could optimize (works for char/short/int/void*):

6: r119:TI=[r118:DI+0x10]
7: [r118:DI]=r119:TI
8: r121:DI=[r118:DI+0x8]

=>

6: r119:TI=[r118:DI+0x10]
16: r122:DI=r119:TI#8

Final ASM will be as below without partial load after full store(stxv+ld):
  ld 10,16(3)
  mr 9,3
  ld 3,24(3)
  std 10,0(9)
  std 3,8(9)
  blr

It could achieve ~25% performance improvement for typical cases on
Power9.  Bootstrap and regression tested on Power9-LE.

BTW, for AArch64, one ldr is replaced by mov with this patch, though
no performance change observerd...

ldp x2, x3, [x0, 16]
stp x2, x3, [x0]
ldr x0, [x0, 8]

=>

mov x1, x0
ldp x2, x0, [x0, 16]
stp x2, x0, [x1]

gcc/ChangeLog:

2020-07-21  Xionghu Luo  

PR rtl-optimization/71309
* dse.c (get_stored_val): Use subreg before extract if shifting
from high part.

gcc/testsuite/ChangeLog:

2020-07-21  Xionghu Luo  

PR rtl-optimization/71309
* gcc.target/powerpc/pr71309.c: New test.
* gcc.target/powerpc/fold-vec-extract-short.p7.c: Add -mbig.
---
 gcc/dse.c | 26 ---
 .../powerpc/fold-vec-extract-short.p7.c   |  2 +-
 gcc/testsuite/gcc.target/powerpc/pr71309.c| 33 +++
 3 files changed, 56 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr71309.c

diff --git a/gcc/dse.c b/gcc/dse.c
index bbe792e48e8..13f952ee5ff 100644
--- a/gcc/dse.c
+++ b/gcc/dse.c
@@ -1855,7 +1855,7 @@ get_stored_val (store_info *store_info, machine_mode 
read_mode,
 {
   machine_mode store_mode = GET_MODE (store_info->mem);
   poly_int64 gap;
-  rtx read_reg;
+  rtx read_reg = NULL;
 
   /* To get here the read is within the boundaries of the write so
  shift will never be negative.  Start out with the shift being in
@@ -1872,9 +1872,27 @@ get_stored_val (store_info *store_info, machine_mode 
read_mode,
 {
   poly_int64 shift = gap * BITS_PER_UNIT;
   poly_int64 access_size = GET_MODE_SIZE (read_mode) + gap;
-  read_reg = find_shift_sequence (access_size, store_info, read_mode,
- shift, optimize_bb_for_speed_p (bb),
- require_cst);
+  rtx rhs_subreg = NULL;
+
+  if (known_eq (GET_MODE_BITSIZE (store_mode), shift * 2))
+   {
+ scalar_int_mode inner_mode = smallest_int_mode_for_size (shift);
+ poly_uint64 sub_off
+   = ((!BYTES_BIG_ENDIAN)
+? GET_MODE_SIZE (store_mode) - GET_MODE_SIZE (inner_mode)
+: 0);
+
+ rhs_subreg = simplify_gen_subreg (inner_mode, store_info->rhs,
+   store_mode, sub_off);
+ if (rhs_subreg)
+   read_reg
+ = extract_low_bits (read_mode, inner_mode, copy_rtx (rhs_subreg));
+   }
+
+  if (read_reg == NULL)
+   read_reg
+ = find_shift_sequence (access_size, store_info, read_mode, shift,
+optimize_bb_for_speed_p (bb), require_cst);
 }
   else if (store_mode == BLKmode)
 {
diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-short.p7.c 
b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-short.p7.c
index 8616e7b11ad..b5cefe7dc12 100644
--- a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-short.p7.c
+++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-short.p7.c
@@ -3,7 +3,7 @@
 
 /* { dg-do compile { target { powerpc*-*-linux* } } } */
 /* { dg-require-effective-target powerpc_vsx_ok } */
-/* { dg-options "-mdejagnu-cpu=power7 -O2" } */
+/* { dg-options "-mdejagnu-cpu=power7 -O2 -mbig" } */
 
 // six tests total. Targeting P7 BE.
 // p7 (be) vars: li, addi,  stxvw4x, rldic, addi, 
lhax/lhzx
diff --git a/gcc/testsuite/gcc.target/powerpc/pr71309.c 
b/gcc/testsuite/gcc.target/powerpc/pr71309.c
new file mode 100644
index 000..94d727a8ed9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr71309.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9" } */
+
+#define TYPE void*
+#define TYPE2 void*
+
+struct path {
+TYPE2 mnt;
+TYPE dentry;
+};
+
+struct nameidata {
+struct path path;
+struct path root;
+};
+
+__attribute__ ((noinline))
+TYPE foo(struct nameidata *nd)
+{
+  TYPE d;
+  TYPE2 d2;
+
+  nd->path = nd->root;
+  d = nd->path.dentry;
+  d2 = nd->path.mnt;
+  return d;
+}
+
+/* { dg-final { scan-assembler-not {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not {\mstxv\M} } } */
+/* { dg-final { scan-assembler-times {\mld\M} 2 } } */
+/* { dg-final { scan-assembler-times {\mstd\M} 2 } } */
-- 
2.27.0.90.geebb51ba8c



[PATCH 2/2] rs6000: Define define_insn_and_split to split unspec sldi+or to rldimi

2020-07-09 Thread Xiong Hu Luo via Gcc-patches
Combine pass could recognize the pattern defined and split it in split1,
this patch could optimize:

21: r130:DI=r133:DI<<0x20
11: {r129:DI=zero_extend(unspec[[r145:DI]] 87);clobber scratch;}
22: r134:DI=r130:DI|r129:DI

to

21: {r149:DI=zero_extend(unspec[[r145:DI]] 87);clobber scratch;}
22: r134:DI=r149:DI&0x|r133:DI<<0x20

rldimi is generated instead of sldi+or.

gcc/ChangeLog:

2020-07-10  Xionghu Luo  

* config/rs6000/rs6000.md (rotl_unspec): New
define_insn_and_split.

gcc/testsuite/ChangeLog:

2020-07-10  Xionghu Luo  

* gcc.target/powerpc/vector_float.c: New test.
---
 gcc/config/rs6000/rs6000.md   | 26 +++
 .../gcc.target/powerpc/vector_float.c | 14 ++
 2 files changed, 40 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/vector_float.c

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 0aa5265d199..64b655df363 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -4239,6 +4239,32 @@
   operands[5] = GEN_INT ((HOST_WIDE_INT_1U << ) - 1);
 })
 
+; rldimi with UNSPEC_SI_FROM_SF.
+(define_insn_and_split "*rotl_unspec"
+  [(set (match_operand:DI 0 "gpc_reg_operand")
+   (ior:DI
+(ashift:DI (match_operand:DI 1 "gpc_reg_operand")
+ (match_operand:SI 2 "const_int_operand"))
+(zero_extend:DI
+ (unspec:QHSI
+  [(match_operand:SF 3 "memory_operand")]
+  UNSPEC_SI_FROM_SF
+  (clobber (match_scratch:V4SF 4))]
+  "INTVAL (operands[2]) == "
+  "#"
+  ""
+  [(parallel [(set (match_dup 5)
+  (zero_extend:DI (unspec:QHSI [(match_dup 3)] UNSPEC_SI_FROM_SF)))
+(clobber (match_dup 4))])
+  (set (match_dup 0)
+   (ior:DI
+(and:DI (match_dup 5) (match_dup 6))
+(ashift:DI (match_dup 1) (match_dup 2]
+{
+  operands[5] = gen_reg_rtx (DImode);
+  operands[6] = GEN_INT ((HOST_WIDE_INT_1U << ) - 1);
+})
+
 ; rlwimi, too.
 (define_split
   [(set (match_operand:SI 0 "gpc_reg_operand")
diff --git a/gcc/testsuite/gcc.target/powerpc/vector_float.c 
b/gcc/testsuite/gcc.target/powerpc/vector_float.c
new file mode 100644
index 000..414824ad264
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/vector_float.c
@@ -0,0 +1,14 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9" } */
+
+vector float
+test (float *a, float *b, float *c, float *d)
+{
+  return (vector float){*a, *b, *c, *d};
+}
+
+/* { dg-final { scan-assembler-not {\mlxsspx\M} } } */
+/* { dg-final { scan-assembler-not {\mlfs\M} } } */
+/* { dg-final { scan-assembler-times {\mlwz\M} 4 } } */
+/* { dg-final { scan-assembler-times {\mrldimi\M} 2 } } */
+/* { dg-final { scan-assembler-times {\mmtvsrdd\M} 1 } } */
-- 
2.27.0.90.geebb51ba8c



[PATCH 1/2] rs6000: Init V4SF vector without converting SP to DP

2020-07-09 Thread Xiong Hu Luo via Gcc-patches
Move V4SF to V4SI, init vector like V4SI and move to V4SF back.
Better instruction sequence could be generated on Power9:

lfs + xxpermdi + xvcvdpsp + vmrgew
=>
lwz + (sldi + or) + mtvsrdd

With the patch followed, it could be continue optimized to:

lwz + rldimi + mtvsrdd

The point is to use lwz to avoid converting the single-precision to
double-precision upon load, pack four 32-bit data into one 128-bit
register directly.

gcc/ChangeLog:

2020-07-10  Xionghu Luo  

* config/rs6000/rs6000.c (rs6000_expand_vector_init):
Move V4SF to V4SI, init vector like V4SI and move to V4SF back.
---
 gcc/config/rs6000/rs6000.c | 49 +++---
 1 file changed, 24 insertions(+), 25 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 58f5d780603..d94e88c23a5 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -6423,35 +6423,34 @@ rs6000_expand_vector_init (rtx target, rtx vals)
}
   else
{
- rtx dbl_even = gen_reg_rtx (V2DFmode);
- rtx dbl_odd  = gen_reg_rtx (V2DFmode);
- rtx flt_even = gen_reg_rtx (V4SFmode);
- rtx flt_odd  = gen_reg_rtx (V4SFmode);
- rtx op0 = force_reg (SFmode, XVECEXP (vals, 0, 0));
- rtx op1 = force_reg (SFmode, XVECEXP (vals, 0, 1));
- rtx op2 = force_reg (SFmode, XVECEXP (vals, 0, 2));
- rtx op3 = force_reg (SFmode, XVECEXP (vals, 0, 3));
-
- /* Use VMRGEW if we can instead of doing a permute.  */
- if (TARGET_P8_VECTOR)
+ rtx tmpSF[4];
+ rtx tmpSI[4];
+ rtx tmpDI[4];
+ rtx mrgDI[4];
+ for (i = 0; i < 4; i++)
{
- emit_insn (gen_vsx_concat_v2sf (dbl_even, op0, op2));
- emit_insn (gen_vsx_concat_v2sf (dbl_odd, op1, op3));
- emit_insn (gen_vsx_xvcvdpsp (flt_even, dbl_even));
- emit_insn (gen_vsx_xvcvdpsp (flt_odd, dbl_odd));
- if (BYTES_BIG_ENDIAN)
-   emit_insn (gen_p8_vmrgew_v4sf_direct (target, flt_even, 
flt_odd));
- else
-   emit_insn (gen_p8_vmrgew_v4sf_direct (target, flt_odd, 
flt_even));
+ tmpSI[i] = gen_reg_rtx (SImode);
+ tmpDI[i] = gen_reg_rtx (DImode);
+ mrgDI[i] = gen_reg_rtx (DImode);
+ tmpSF[i] = force_reg (SFmode, XVECEXP (vals, 0, i));
+ emit_insn (gen_movsi_from_sf (tmpSI[i], tmpSF[i]));
+ emit_insn (gen_zero_extendsidi2 (tmpDI[i], tmpSI[i]));
}
- else
+
+ if (!BYTES_BIG_ENDIAN)
{
- emit_insn (gen_vsx_concat_v2sf (dbl_even, op0, op1));
- emit_insn (gen_vsx_concat_v2sf (dbl_odd, op2, op3));
- emit_insn (gen_vsx_xvcvdpsp (flt_even, dbl_even));
- emit_insn (gen_vsx_xvcvdpsp (flt_odd, dbl_odd));
- rs6000_expand_extract_even (target, flt_even, flt_odd);
+ std::swap (tmpDI[0], tmpDI[1]);
+ std::swap (tmpDI[2], tmpDI[3]);
}
+
+ emit_insn (gen_ashldi3 (mrgDI[0], tmpDI[0], GEN_INT (32)));
+ emit_insn (gen_iordi3 (mrgDI[1], mrgDI[0], tmpDI[1]));
+ emit_insn (gen_ashldi3 (mrgDI[2], tmpDI[2], GEN_INT (32)));
+ emit_insn (gen_iordi3 (mrgDI[3], mrgDI[2], tmpDI[3]));
+
+ rtx tmpV2DI = gen_reg_rtx (V2DImode);
+ emit_insn (gen_vsx_concat_v2di (tmpV2DI, mrgDI[1], mrgDI[3]));
+ emit_move_insn (target, gen_lowpart (V4SFmode, tmpV2DI));
}
   return;
 }
-- 
2.27.0.90.geebb51ba8c