Hi, Robin.

I model scalar value initialization accurately with following patch:

+/* Adjust vectorization cost after calling
+   targetm.vectorize.builtin_vectorization_cost. For some statement, we would
+   like to further fine-grain tweak the cost on top of
+   targetm.vectorize.builtin_vectorization_cost handling which doesn't have any
+   information on statement operation codes etc.  */
+
+static unsigned
+adjust_stmt_cost (class vec_info *vinfo, enum vect_cost_for_stmt kind,
+                 struct _stmt_vec_info *stmt_info, tree vectype, int count,
+                 int stmt_cost)
+{
+  gimple *stmt = stmt_info->stmt;
+  machine_mode smode = TYPE_MODE (TREE_TYPE (vectype));
+  switch (kind)
+    {
+      case scalar_to_vec: {
+       stmt_cost *= count;
+       /* Adjust cost by counting the scalar value initialization.  */
+       for (int i = 0; i < count; i++)
+         {
+           tree arg = gimple_arg (stmt, i);
+           if (poly_int_tree_p (arg))
+             {
+               poly_int64 value = tree_to_poly_int64 (arg);
+               int scalar_cost
+                 = riscv_const_insns (gen_int_mode (value, smode));
+               stmt_cost += scalar_cost;
+             }
+           else
+             stmt_cost += 1;
+         }
+       return stmt_cost;
+      }
+    default:
+      break;
+    }
+  return count * stmt_cost;
+}
+
 unsigned
 costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
                      stmt_vec_info stmt_info, slp_tree, tree vectype,
@@ -1082,9 +1122,13 @@ costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
         as one iteration of the VLA loop.  */
       if (where == vect_body && m_unrolled_vls_niters)
        m_unrolled_vls_stmts += count * m_unrolled_vls_niters;
+
+      if (vectype)
+       stmt_cost = adjust_stmt_cost (m_vinfo, kind, stmt_info, vectype, count,
+                                     stmt_cost);
     }

-  return record_stmt_cost (stmt_info, where, count * stmt_cost);
+  return record_stmt_cost (stmt_info, where, stmt_cost);
 }

32872 >> patt_126 1 times scalar_to_vec costs 3 in prologue

32872 spends 2 scalar instructions + 1 scalar_to_vec cost:

li a4,-32768
addiw a4,a4,104
vmv.v.x v16,a4

It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but 
failed on -march=rv64gcv_zvl4096b.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 19:15
To: juzhe.zh...@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized 
> codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.
 
Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.
 
My point is cost comparison only has the scalar loop to compare
against which is:
 
li a5,1
li a3,19
.L2:
mv a4,a5
addiw a5,a5,1
bne a5,a3,.L2
 
That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.
 
How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.
 
Regards
Robin
 

Reply via email to