Re: Re: [PATCH V2] RISC-V: Fix unexpected big LMUL choosing in dynamic LMUL model for non-adjacent load/store

2023-10-16 Thread 钟居哲
>> So we're inserting a dummy vect_perm element (that's live from the start?).
>> Would it make sense to instead increase the number of needed registers for
>> a load/store and handle this similarly to compute_nregs_for_mode?
>> Maybe also do it directly in compute_local_live_ranges and extend live_range
>> by an nregs?

Yeah,  we're inserting a dummy vect_perm element which has live range from 0 to 
max_point of the bb.
I have tried it in 'compute_nregs_for_mode' and 'compute_local_live_ranges' 
since we don't know the maximum point of bb yet.

Address other comments in V3:
[PATCH V3] RISC-V: Fix unexpected big LMUL choosing in dynamic LMUL model for 
non-adjacent load/store (gnu.org)



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-10-16 20:33
To: Juzhe-Zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; jeffreyalaw
Subject: Re: [PATCH V2] RISC-V: Fix unexpected big LMUL choosing in dynamic 
LMUL model for non-adjacent load/store
Hi Juzhe,
 
> +/* Get STORE value.  */
> +static tree
> +get_store_value (gimple *stmt)
> +{
> +  if (is_gimple_call (stmt) && gimple_call_internal_p (stmt))
> +{
> +  if (gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
> + return gimple_call_arg (stmt, 3);
> +  else
> + gcc_unreachable ();
> +}
> +  else
> +return gimple_assign_rhs1 (stmt);
> +}
 
This was something I was about to mention in the review of v1
that I already started.  This is better now.
 
> +   basic_block bb = e->src;
 
Rename to pred or so?  And then keep the original bb.
 
>if (!live_range)
> - continue;
> + {
> +   if (single_succ_p (e->src))
> + {
> +   /*
> +  [local count: 850510900]:
> + goto ; [100.00%]
> +
> + Handle this case, we should extend live range of bb 3.
> +   */
 
/* If the predecessor is an extended basic block extend it and look for
   DEF's definition again.  */
 
> +   bb = single_succ (e->src);
> +   if (!program_points_per_bb.get (bb))
> + continue;
> +   live_ranges = live_ranges_per_bb.get (bb);
> +   max_point
> + = (*program_points_per_bb.get (bb)).length () - 1;
> +   live_range = live_ranges->get (def);
> +   if (!live_range)
> + continue;
> + }
> +   else
> + continue;
> + }
 
We're approaching a complexity where reverse postorder would have
helped ;)  Maybe split out the live range into a separate function
get_live_range_for_bb ()?
 
> +  for (si = gsi_start_bb (bbs[i]); !gsi_end_p (si); gsi_next ())
> + {
> +   if (!(is_gimple_assign (gsi_stmt (si))
> + || is_gimple_call (gsi_stmt (si
> + continue;
> +   stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi_stmt (si));
> +   enum stmt_vec_info_type type
> + = STMT_VINFO_TYPE (vect_stmt_to_vectorize (stmt_info));
> +   if ((type == load_vec_info_type || type == store_vec_info_type)
> +   && !adjacent_dr_p (STMT_VINFO_DATA_REF (stmt_info)))
> + {
> +   /* For non-adjacent load/store STMT, we will potentially
> + convert it into:
> +
> +1. MASK_LEN_GATHER_LOAD (..., perm indice).
> +2. Continguous load/store + VEC_PERM (..., perm indice)
> +
> + We will be likely using one more vector variable.  */
> +   unsigned int max_point
> + = (*program_points_per_bb.get (bbs[i])).length () - 1;
> +   auto *live_ranges = live_ranges_per_bb.get (bbs[i]);
> +   bool existed_p = false;
> +   tree var = type == load_vec_info_type
> +? gimple_get_lhs (gsi_stmt (si))
> +: get_store_value (gsi_stmt (si));
> +   tree sel_type = build_nonstandard_integer_type (
> + TYPE_PRECISION (TREE_TYPE (var)), 1);
> +   tree sel = build_decl (UNKNOWN_LOCATION, VAR_DECL,
> +  get_identifier ("vect_perm"), sel_type);
> +   pair _range = live_ranges->get_or_insert (sel, _p);
> +   gcc_assert (!existed_p);
> +   live_range = pair (0, max_point);
> +   if (dump_enabled_p ())
> + dump_printf_loc (MSG_NOTE, vect_location,
> + "Add perm indice %T, start = 0, end = %d\n",
> + sel, max_point);
> + }
> + }
>  }
>  }
 
So we're inserting a dummy vect_perm element (that's live from the start?).
Would it make sense to instead increase the number of needed registers for
a load/store and handle this similarly to compute_nregs_for_mode?
Maybe also do it directly in compute_local_live_ranges and extend live_range
by an nregs?
 
Regards
Robin
 
 


Re: [PATCH V2] RISC-V: Fix unexpected big LMUL choosing in dynamic LMUL model for non-adjacent load/store

2023-10-16 Thread Robin Dapp
Hi Juzhe,

> +/* Get STORE value.  */
> +static tree
> +get_store_value (gimple *stmt)
> +{
> +  if (is_gimple_call (stmt) && gimple_call_internal_p (stmt))
> +{
> +  if (gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
> + return gimple_call_arg (stmt, 3);
> +  else
> + gcc_unreachable ();
> +}
> +  else
> +return gimple_assign_rhs1 (stmt);
> +}

This was something I was about to mention in the review of v1
that I already started.  This is better now.

> +   basic_block bb = e->src;

Rename to pred or so?  And then keep the original bb.

> if (!live_range)
> - continue;
> + {
> +   if (single_succ_p (e->src))
> + {
> +   /*
> +   [local count: 850510900]:
> +  goto ; [100.00%]
> +
> +  Handle this case, we should extend live range of bb 3.
> +   */

/* If the predecessor is an extended basic block extend it and look for
   DEF's definition again.  */

> +   bb = single_succ (e->src);
> +   if (!program_points_per_bb.get (bb))
> + continue;
> +   live_ranges = live_ranges_per_bb.get (bb);
> +   max_point
> + = (*program_points_per_bb.get (bb)).length () - 1;
> +   live_range = live_ranges->get (def);
> +   if (!live_range)
> + continue;
> + }
> +   else
> + continue;
> + }

We're approaching a complexity where reverse postorder would have
helped ;)  Maybe split out the live range into a separate function
get_live_range_for_bb ()?

> +  for (si = gsi_start_bb (bbs[i]); !gsi_end_p (si); gsi_next ())
> + {
> +   if (!(is_gimple_assign (gsi_stmt (si))
> + || is_gimple_call (gsi_stmt (si
> + continue;
> +   stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi_stmt (si));
> +   enum stmt_vec_info_type type
> + = STMT_VINFO_TYPE (vect_stmt_to_vectorize (stmt_info));
> +   if ((type == load_vec_info_type || type == store_vec_info_type)
> +   && !adjacent_dr_p (STMT_VINFO_DATA_REF (stmt_info)))
> + {
> +   /* For non-adjacent load/store STMT, we will potentially
> +  convert it into:
> +
> +1. MASK_LEN_GATHER_LOAD (..., perm indice).
> +2. Continguous load/store + VEC_PERM (..., perm indice)
> +
> + We will be likely using one more vector variable.  */
> +   unsigned int max_point
> + = (*program_points_per_bb.get (bbs[i])).length () - 1;
> +   auto *live_ranges = live_ranges_per_bb.get (bbs[i]);
> +   bool existed_p = false;
> +   tree var = type == load_vec_info_type
> +? gimple_get_lhs (gsi_stmt (si))
> +: get_store_value (gsi_stmt (si));
> +   tree sel_type = build_nonstandard_integer_type (
> + TYPE_PRECISION (TREE_TYPE (var)), 1);
> +   tree sel = build_decl (UNKNOWN_LOCATION, VAR_DECL,
> +  get_identifier ("vect_perm"), sel_type);
> +   pair _range = live_ranges->get_or_insert (sel, _p);
> +   gcc_assert (!existed_p);
> +   live_range = pair (0, max_point);
> +   if (dump_enabled_p ())
> + dump_printf_loc (MSG_NOTE, vect_location,
> +  "Add perm indice %T, start = 0, end = %d\n",
> +  sel, max_point);
> + }
> + }
>  }
>  }

So we're inserting a dummy vect_perm element (that's live from the start?).
Would it make sense to instead increase the number of needed registers for
a load/store and handle this similarly to compute_nregs_for_mode?
Maybe also do it directly in compute_local_live_ranges and extend live_range
by an nregs?

Regards
 Robin



[PATCH V2] RISC-V: Fix unexpected big LMUL choosing in dynamic LMUL model for non-adjacent load/store

2023-10-16 Thread Juzhe-Zhong
Consider this following case:
int
bar (int *x, int a, int b, int n)
{
  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
  int sum1 = 0;
  int sum2 = 0;
  for (int i = 0; i < n; ++i)
{
  sum1 += x[2*i] - a;
  sum1 += x[2*i+1] * b;
  sum2 += x[2*i] - b;
  sum2 += x[2*i+1] * a;
}
  return sum1 + sum2;
}

Before this patch:

bar:
ble a3,zero,.L5
csrrt0,vlenb
csrra6,vlenb
sllit1,t0,3
vsetvli a5,zero,e32,m4,ta,ma
sub sp,sp,t1
vid.v   v20
vmv.v.x v12,a1
vand.vi v4,v20,1
vmv.v.x v16,a2
vmseq.viv4,v4,1
sllit3,a6,2
vsetvli zero,a5,e32,m4,ta,ma
vmv1r.v v0,v4
viota.m v8,v4
add a7,t3,sp
vsetvli a5,zero,e32,m4,ta,mu
vand.vi v28,v20,-2
vadd.vi v4,v28,1
vs4r.v  v20,0(a7)-  spill
vrgather.vv v24,v12,v8
vrgather.vv v20,v16,v8
vrgather.vv v24,v16,v8,v0.t
vrgather.vv v20,v12,v8,v0.t
vs4r.v  v4,0(sp)  - spill
sllia3,a3,1
addit4,a6,-1
neg t1,a6
vmv4r.v v0,v20
vmv.v.i v4,0
j   .L4
.L13:
vsetvli a5,zero,e32,m4,ta,ma
.L4:
mv  a7,a3
mv  a4,a3
bleua3,a6,.L3
csrra4,vlenb
.L3:
vmv.v.x v8,t4
vl4re32.v   v12,0(sp) spill
vand.vv v20,v28,v8
vand.vv v8,v12,v8
vsetvli zero,a4,e32,m4,ta,ma
vle32.v v16,0(a0)
vsetvli a5,zero,e32,m4,ta,ma
add a3,a3,t1
vrgather.vv v12,v16,v20
add a0,a0,t3
vrgather.vv v20,v16,v8
vsub.vv v12,v12,v0
vsetvli zero,a4,e32,m4,tu,ma
vadd.vv v4,v4,v12
vmacc.vvv4,v24,v20
bgtua7,a6,.L13
csrra1,vlenb
sllia1,a1,2
add a1,a1,sp
li  a4,-1
csrrt0,vlenb
vsetvli a5,zero,e32,m4,ta,ma
vl4re32.v   v12,0(a1)    spill
vmv.v.i v8,0
vmul.vx v0,v12,a4
li  a2,0
sllit1,t0,3
vadd.vi v0,v0,-1
vand.vi v0,v0,1
vmseq.vvv0,v0,v8
vand.vi v12,v12,1
vmerge.vvm  v16,v8,v4,v0
vmseq.vvv12,v12,v8
vmv.s.x v1,a2
vmv1r.v v0,v12
vredsum.vs  v16,v16,v1
vmerge.vvm  v8,v8,v4,v0
vmv.x.s a0,v16
vredsum.vs  v8,v8,v1
vmv.x.s a5,v8
add sp,sp,t1
addwa0,a0,a5
jr  ra
.L5:
li  a0,0
ret

We can there are multiple horrible register spillings.
The root cause of this issue is for a scalar IR load:

_5 = *_4;

We didn't check whether it is a continguous load/store or gather/scatter 
load/store

Since it will be translate into:

   1. MASK_LEN_GATHER_LOAD (..., perm indice).
   2. Continguous load/store + VEC_PERM (..., perm indice)

It's obvious that no matter which situation, we will end up with consuming one 
vector register group (perm indice)
that we didn't count it before.

So this case we pick LMUL = 4 which is incorrect choice for dynamic LMUL cost 
model.

The key of this patch is:

  if ((type == load_vec_info_type || type == store_vec_info_type)
  && !adjacent_dr_p (STMT_VINFO_DATA_REF (stmt_info)))
{
   ...
}

Add one more register consumption if it is not an adjacent load/store.

After this patch, it pick LMUL = 2 which is optimal:

bar:
ble a3,zero,.L4
csrra6,vlenb
vsetvli a5,zero,e32,m2,ta,ma
vmv.v.x v6,a2
srlia2,a6,1
vmv.v.x v4,a1
vid.v   v12
sllia3,a3,1
vand.vi v0,v12,1
addit1,a2,-1
vmseq.viv0,v0,1
sllia6,a6,1
vsetvli zero,a5,e32,m2,ta,ma
neg a7,a2
viota.m v2,v0
vsetvli a5,zero,e32,m2,ta,mu
vrgather.vv v16,v4,v2
vrgather.vv v14,v6,v2
vrgather.vv v16,v6,v2,v0.t
vrgather.vv v14,v4,v2,v0.t
vand.vi v18,v12,-2
vmv.v.i v2,0
vadd.vi v20,v18,1
.L3:
minua4,a3,a2
vsetvli zero,a4,e32,m2,ta,ma
vle32.v v8,0(a0)
vsetvli a5,zero,e32,m2,ta,ma
vmv.v.x v4,t1
vand.vv v10,v18,v4
vrgather.vv v6,v8,v10
vsub.vv v6,v6,v14
vsetvli zero,a4,e32,m2,tu,ma
vadd.vv v2,v2,v6
vsetvli a1,zero,e32,m2,ta,ma
vand.vv v4,v20,v4
vrgather.vv v6,v8,v4
vsetvli zero,a4,e32,m2,tu,ma
mv  a4,a3
add a0,a0,a6
add a3,a3,a7
vmacc.vvv2,v16,v6
bgtua4,a2,.L3
vsetvli a1,zero,e32,m2,ta,ma
vand.vi v0,v12,1
vmv.v.i v4,0
li