On 2020/9/14 17:47, Richard Biener wrote:
On Mon, Sep 14, 2020 at 10:05 AM luoxhu <luo...@linux.ibm.com> wrote:

Not sure whether this reflects the issues you discussed above.
I constructed below test cases and tested with and without this patch,
only if "a+c"(which means store only), the performance is getting bad with
this patch due to extra load/store(about 50% slower).
For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
patch will always get big benefit.
But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
inside the function or out of the function soon.  Does this mean unify the two
usage to same gimple code not a good idea sometimes?  Or is it possible to
detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
used not far away inside or out side of the function?


#define TYPE int

vector TYPE v = {1, 2, 3, 4};   // a. global vector.
vector TYPE s = {4, 3, 2, 1};

__attribute__ ((noinline))
vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
   v[n % 4] = i;

^^^

this should be

    u[n % 4] = i;

I guess.  Is the % 4 mandated by the vec_insert semantics btw?

Yes.  Segher pasted the builtin description in his reply.  "v = vec_insert (i, u, 
n);"
is a bit different with "u[n % 4] = i;" since it returns a copy of u instead of 
modify
u. The adjust in lower __builtin_vec_insert_xxx gimple will make a copy of u 
first to
meet the requirements.


If you tested with the above error you probably need to re-measure.

No I did test for u as local instead of global before.  If u is not used very
soon, the performance is almost the same for generating single store or IFN_SET
of inserting with variable index.

source:
__attribute__ ((noinline)) vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
 u[n % 4] = i;
 vector TYPE r = {0};
 for (long k = 0; k < 100; k++)
   {
     r += v;
   }
 return u+r;
}

=> store hit load is relieved due to long distance.

ASM:
0:      addis 2,12,.TOC.-.LCF0@ha
       addi 2,2,.TOC.-.LCF0@l
       .localentry     test,.-test
       addis 10,2,.LANCHOR0@toc@ha
       li 8,50
       xxspltib 32,0
       addi 9,1,-16
       rldic 6,6,2,60
       stxv 34,-16(1)
       addi 10,10,.LANCHOR0@toc@l
       mtctr 8
       xxlor 33,32,32
       stwx 5,9,6      // short store
       lxv 45,0(10)
       .p2align 4,,15
.L2:
       vadduwm 0,0,13
       vadduwm 1,1,13
       bdnz .L2
       vadduwm 0,0,1
       lxv 33,-16(1)   // wide load
       vadduwm 2,0,1
       blr


Then I intend to use "v" there for global memory insert test to separate
the store and load into different function ("v" is stored in function,
but loaded out of function will get different performance for single store
or lxv+xxperm+xxsel+stxv, the inner function doesn't know the distance
between load and store across functions).

Do you mean that we don't need generate IFN_SET if "v" is global memory?
I only see VAR_DECL and PARM_DECL, is there any function to check the tree
variable is global? I added DECL_REGISTER, but the RTL still expands to stack:

gcc/internal-fn.c: rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, 
EXPAND_WRITE);

(gdb) p view_op0
$584 = <var_decl 0x7ffff7f705a0>
(gdb) p DECL_REGISTER(view_op0)
$585 = 1
(gdb) p to_rtx
$586 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3190+0 
S16 A128])



On gimple the above function (after fixing it) looks like

   VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D);

and the IFN idea I had would - for non-global memory 'u' only - transform
this to

   vector_register_2 = u;
   vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
   u = vector_register_3;

if vec_set can handle variable indexes.  This then becomes a
vec_set on a register and if that was the only variable indexing of 'u'
will also cause 'u' to be expanded as register rather than stack memory.

Note we can't use the direct-optab method here since the vec_set optab
modifies operand 0 which isn't possible in SSA form.  That might hint
at that we eventually want to extend BIT_INSERT_EXPR to handle
a non-constant bit position but for experiments using an alternate
internal function is certainly easier.


My current implementation does:

1)  v = vec_insert (i, u, n);

=>gimple:
{
 register __vector signed int D.3190;
 D.3190 = u;            // *new decl and copy u first.*
 _1 = n & 3;
 VIEW_CONVERT_EXPR<int[4]>(D.3190)[_1] = i;   // *update op0 of 
VIEW_CONVERT_EXPR*
 _2 = D.3190;
 ...
}

=>isel:
{
 register __vector signed int D.3190;
 D.3190 = u_4(D);
 _1 = n_6(D) & 3;
 .VEC_SET (&D.3190, i_7(D), _1);
 _2 = D.3190;
 ...
}


2) u[n % 4] = i;

=>gimple:
{
 __vector signed int D.3191;
 _1 = n & 3;
 VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i;   // *update op0 of VIEW_CONVERT_EXPR*
D.3191 = u; ...
}

=>isel:
{
 D.3190 = u_4(D);
 _1 = n_6(D) & 3;
 .VEC_SET (&D.3191, i_7(D), _1);
 _2 = D.3190;
 v = _2;
}

The IFN ".VEC_SET" behavior quite like the other IFN STORE functions and doesn't
require a dest operand to be set?  Both 1) and 2) are modifying operand 0 of
VIEW_CONVERT_EXPR just like vec_set optab.
Attached the IFN vec_set patch part, the expand part is moved from 
expr.c:expand_assignment
to internal-fn.c:expand_vec_set_optab_fn now.


Thanks,
Xionghu
From ae64e4903ebf945995501cd57569cfe4939bc574 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <luo...@linux.ibm.com>
Date: Mon, 14 Sep 2020 21:08:11 -0500
Subject: [PATCH] IFN: Implement IFN_VEC_SET for vec_insert

gcc/ChangeLog:

        * gimple-isel.cc (gimple_expand_vec_set_expr):
        (gimple_expand_vec_cond_exprs):
        * internal-fn.c (vec_set_direct):
        (expand_vec_set_optab_fn):
        (direct_vec_set_optab_supported_p):
        * internal-fn.def (VEC_SET):
---
 gcc/gimple-isel.cc  | 117 +++++++++++++++++++++++++++++++++++++++++++-
 gcc/internal-fn.c   |  38 ++++++++++++++
 gcc/internal-fn.def |   2 +
 3 files changed, 155 insertions(+), 2 deletions(-)

diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc
index b330cf4c20e..92afe0306fa 100644
--- a/gcc/gimple-isel.cc
+++ b/gcc/gimple-isel.cc
@@ -35,6 +35,76 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-cfg.h"
 #include "bitmap.h"
 #include "tree-ssa-dce.h"
+#include "fold-const.h"
+
+static gimple *
+gimple_expand_vec_extract_expr (
+  gimple_stmt_iterator *gsi,
+  hash_map<tree, unsigned int> *vec_cond_ssa_name_uses)
+{
+  enum tree_code code;
+
+  /* Only consider code == GIMPLE_ASSIGN.  */
+  gassign *stmt = dyn_cast<gassign *> (gsi_stmt (*gsi));
+  if (!stmt)
+    return NULL;
+
+  code = gimple_assign_rhs_code (stmt);
+  if (code != ARRAY_REF)
+    return NULL;
+
+  return NULL;
+}
+
+static gimple *
+gimple_expand_vec_set_expr (gimple_stmt_iterator *gsi)
+{
+  enum tree_code code;
+  enum insn_code icode;
+  gcall *new_stmt = NULL;
+
+  /* Only consider code == GIMPLE_ASSIGN.  */
+  gassign *stmt = dyn_cast<gassign *> (gsi_stmt (*gsi));
+  if (!stmt)
+    return NULL;
+
+  code = TREE_CODE (gimple_assign_lhs (stmt));
+  if (code != ARRAY_REF)
+    return NULL;
+
+  tree lhs = gimple_assign_lhs (stmt);
+  tree val = gimple_assign_rhs1 (stmt);
+
+  tree type = TREE_TYPE (lhs);
+  tree op0 = TREE_OPERAND (lhs, 0);
+  if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
+      && tree_fits_uhwi_p (TYPE_SIZE (type)))
+    {
+      tree pos = TREE_OPERAND (lhs, 1);
+      tree view_op0 = TREE_OPERAND (op0, 0);
+      mark_addressable (view_op0);
+      machine_mode outermode = TYPE_MODE (TREE_TYPE (view_op0));
+      scalar_mode innermode = GET_MODE_INNER (outermode);
+      wide_int minv, maxv;
+      if (TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE
+         && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0)))
+         && tree_to_uhwi (TYPE_SIZE (TREE_TYPE (view_op0))) == 128
+         && determine_value_range (pos, &minv, &maxv) == VR_RANGE
+         && wi::geu_p (minv, 0)
+         && wi::leu_p (maxv, (128 / GET_MODE_BITSIZE (innermode))))
+       {
+         tree addr
+           = force_gimple_operand_gsi (gsi, build_fold_addr_expr (view_op0),
+                                       true, NULL_TREE, true, GSI_SAME_STMT);
+
+         new_stmt
+           = gimple_build_call_internal (IFN_VEC_SET, 3, addr, val, pos);
+         gimple_move_vops (new_stmt, stmt);
+       }
+    }
+
+  return new_stmt;
+}
 
 /* Expand all VEC_COND_EXPR gimple assignments into calls to internal
    function based on type of selected expansion.  */
@@ -187,8 +257,26 @@ gimple_expand_vec_cond_exprs (void)
     {
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
        {
-         gimple *g = gimple_expand_vec_cond_expr (&gsi,
-                                                  &vec_cond_ssa_name_uses);
+         gassign *stmt = dyn_cast<gassign *> (gsi_stmt (gsi));
+         if (!stmt)
+           continue;
+
+         enum tree_code code;
+         gimple *g = NULL;
+         code = gimple_assign_rhs_code (stmt);
+         switch (code)
+           {
+           case VEC_COND_EXPR:
+             g = gimple_expand_vec_cond_expr (&gsi, &vec_cond_ssa_name_uses);
+             break;
+           case ARRAY_REF:
+             g = gimple_expand_vec_extract_expr (&gsi,
+                                                 &vec_cond_ssa_name_uses);
+             break;
+           default:
+             break;
+           }
+
          if (g != NULL)
            {
              tree lhs = gimple_assign_lhs (gsi_stmt (gsi));
@@ -204,6 +292,31 @@ gimple_expand_vec_cond_exprs (void)
 
   simple_dce_from_worklist (dce_ssa_names);
 
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+       {
+         gassign *stmt = dyn_cast<gassign *> (gsi_stmt (gsi));
+         if (!stmt)
+           continue;
+
+         enum tree_code code;
+         gimple *g = NULL;
+         code = TREE_CODE (gimple_assign_lhs (stmt));
+         switch (code)
+           {
+           case ARRAY_REF:
+             g = gimple_expand_vec_set_expr (&gsi);
+             break;
+           default:
+             break;
+           }
+
+         if (g != NULL)
+           gsi_replace (&gsi, g, false);
+       }
+    }
+
   return 0;
 }
 
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 8efc77d986b..490d2caa582 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -115,6 +115,7 @@ init_internal_fns ()
 #define vec_condeq_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
 #define len_store_direct { 3, 3, false }
+#define vec_set_direct { 3, 3, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2658,6 +2659,42 @@ expand_vect_cond_mask_optab_fn (internal_fn, gcall 
*stmt, convert_optab optab)
 
 #define expand_vec_cond_mask_optab_fn expand_vect_cond_mask_optab_fn
 
+static void
+expand_vec_set_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+{
+  tree op0 = gimple_call_arg (stmt, 0);
+  tree op1 = gimple_call_arg (stmt, 1);
+  tree op2 = gimple_call_arg (stmt, 2);
+  tree view_op0 = TREE_OPERAND (op0, 0);
+  rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, EXPAND_WRITE);
+
+  machine_mode outermode = TYPE_MODE (TREE_TYPE (view_op0));
+  scalar_mode innermode = GET_MODE_INNER (outermode);
+
+  rtx value = expand_expr (op1, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+  rtx pos = expand_expr (op2, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+
+  rtx temp_target = gen_reg_rtx (outermode);
+  emit_move_insn (temp_target, to_rtx);
+
+  class expand_operand ops[3];
+  enum insn_code icode = optab_handler (optab, outermode);
+
+  if (icode != CODE_FOR_nothing)
+    {
+      pos = convert_to_mode (E_SImode, pos, 0);
+
+      create_fixed_operand (&ops[0], temp_target);
+      create_input_operand (&ops[1], value, innermode);
+      create_input_operand (&ops[2], pos, GET_MODE (pos));
+      if (maybe_expand_insn (icode, 3, ops))
+       {
+         emit_move_insn (to_rtx, temp_target);
+         return;
+       }
+    }
+}
+
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
 {
@@ -3253,6 +3290,7 @@ multi_vector_optab_supported_p (convert_optab optab, 
tree_pair types,
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
 #define direct_mask_fold_left_optab_supported_p direct_optab_supported_p
 #define direct_check_ptrs_optab_supported_p direct_optab_supported_p
+#define direct_vec_set_optab_supported_p direct_optab_supported_p
 
 /* Return the optab used by internal function FN.  */
 
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 13e60828fcf..e6cfe1b6159 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -145,6 +145,8 @@ DEF_INTERNAL_OPTAB_FN (VCONDU, 0, vcondu, vec_condu)
 DEF_INTERNAL_OPTAB_FN (VCONDEQ, 0, vcondeq, vec_condeq)
 DEF_INTERNAL_OPTAB_FN (VCOND_MASK, 0, vcond_mask, vec_cond_mask)
 
+DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
+
 DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
 
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
-- 
2.27.0.90.geebb51ba8c

Reply via email to