Re: [PATCH V11] VECT: Add decrement IV support in Loop Vectorizer

2023-05-19 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
>>> I don't think this is a property of decrementing IVs.  IIUC it's really
>>> a property of rgl->factor == 1 && factor == 1, where factor would need
>>> to be passed in by the caller.  Because of that, it should probably be
>>> a separate patch.
> Is it right that I just post this part code as a seperate patch then merge it?

No, not in its current form.  Like I say, the test should be based on
factors rather than TYPE_VECTOR_SUBPARTS.  But a fix for this problem
should come before the changes to IVs.

>>> That is, current LOAD_LEN targets have two properties (IIRC):
>>> (1) all vectors used in a given piece of vector code have the same byte size
>>> (2) lengths are measured in bytes rather than elements
>>> For all cases, including SVE, the number of controls needed for a scalar
>>> statement is equal to the number of vectors needed for that scalar
>>> statement.
>>> Because of (1), on current LOADL_LEN targets, the number of controls
>>> needed for a scalar statement is also proportional to the total number
>>> of bytes occupied by the vectors generated for that scalar statement.
>>> And because of (2), the total number of bytes is the only thing that
>>> matters, so all users of a particular control can use the same control
>>> value.
>>> E.g. on current LOAD_LEN targets, 2xV16QI and 2xV8HI would use the same
>>> control (with no adjustment).  2xV16QI means 32 elements, while 2xV8HI
>>> means 16 elements.  V16QI's nscalars_per_iter would therefore be double
>>> V8HI's, but V8HI's factor would be double V16QI's (2 vs 1), so things
>>> even out.
>>> The code structurally supports targets that count in elements rather
>>> than bytes, so that factor==1 for all element types.  See the
>>> "rgl->factor == 1 && factor == 1" case in:
>  >>  if (rgl->max_nscalars_per_iter < nscalars_per_iter)  >>   {  >> /* 
> For now, we only support cases in which all loads and stores fall back to 
> VnQI or none do.  */
>>>gcc_assert (!rgl->max_nscalars_per_iter>>  || 
> (rgl->factor == 1 && factor == 1)
> || (rgl->max_nscalars_per_iter * rgl->factor
>>>   == nscalars_per_iter * factor));
>  >>  rgl->max_nscalars_per_iter = nscalars_per_iter; >>  rgl->type = 
> vectype; >>  rgl->factor = factor;  >>   }>> But it hasn't been tested, 
> since no current target uses it.
>>> I think the above part of the patch shows that the current "factor is
>>> always 1" path is in fact broken, and the patch is a correctness fix on
>>> targets that measure in elements rather than bytes.
>>> So I think the above part of the patch should go in ahead of the IV changes.
>>> But the test should be based on factor rather than TYPE_VECTOR_SUBPARTS.
> Since the length control measured by bytes instead of bytes is not
> appropriate for RVV.You mean I can't support RVV auto-vectorization in
> upstream GCC middle-end and I can only support it in my downstream, is
> that right?

No.  I haven't said in this or previous reviews that something cannot be
supported in upstream GCC.

I'm saying that the code in theory supports counting in bytes *or*
counting in elements.  But only the first one has actually been tested.
And so, perhaps not surprisingly, the support for counting elements
needs a fix.

The fix in your patch looks like it's on the right lines, but it should be
based on factor rather than TYPE_VECTOR_SUBPARTS.

See get_len_load_store_mode for how this selection happens:

(1) IFN_LOAD_LEN itself always counts in elements rather than bytes.

(2) If a target has instructions that count in elements, it should
define load_len patterns for all vector modes that it supports.

(3) If a target has instructions that count in bytes, it should define
load_len patterns only for byte modes.  The vectoriser will then
use byte loads for all vector types (even things like V8HI).

For (2), the loop controls will always have a factor of 1.
For (3), the loop controls will have a factor equal to the element
size in bytes.  See:

  machine_mode vmode;
  if (get_len_load_store_mode (vecmode, is_load).exists ())
{
  nvectors = group_memory_nvectors (group_size * vf, nunits);
  vec_loop_lens *lens = _VINFO_LENS (loop_vinfo);
  unsigned factor = (vecmode == vmode) ? 1 : GET_MODE_UNIT_SIZE (vecmode);
  vect_record_loop_len (loop_vinfo, lens, nvectors, vectype, factor);
  using_partial_vectors_p = true;
}

This part should work correctly for RVV and any future targets that
measure in elements rather than bytes.  The problem is here:

tree final_len
  = vect_get_loop_len (loop_vinfo, loop_lens,
   vec_num * ncopies,
   vec_num * j + i);
tree ptr = build_int_cst (ref_type,
  align * BITS_PER_UNIT);


Re: [PATCH V11] VECT: Add decrement IV support in Loop Vectorizer

2023-05-19 Thread Richard Sandiford via Gcc-patches
Thanks for the update.  I'll split this review into two pieces.
Second piece to follow (not sure when, but hopefully soon).

juzhe.zh...@rivai.ai writes:
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index ed0166fedab..6f49bdee009 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -10364,12 +10375,14 @@ vect_record_loop_len (loop_vec_info loop_vinfo, 
> vec_loop_lens *lens,
> rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
>  
>  tree
> -vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> -unsigned int nvectors, unsigned int index)
> +vect_get_loop_len (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi,
> +vec_loop_lens *lens, unsigned int nvectors, tree vectype,
> +unsigned int index)
>  {
>rgroup_controls *rgl = &(*lens)[nvectors - 1];
>bool use_bias_adjusted_len =
>  LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) != 0;
> +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
>  
>/* Populate the rgroup's len array, if this is the first time we've
>   used it.  */
> @@ -10400,6 +10413,26 @@ vect_get_loop_len (loop_vec_info loop_vinfo, 
> vec_loop_lens *lens,
>  
>if (use_bias_adjusted_len)
>  return rgl->bias_adjusted_ctrl;
> +  else if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> +{
> +  tree loop_len = rgl->controls[index];
> +  poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type);
> +  poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype);
> +  if (maybe_ne (nunits1, nunits2))
> + {
> +   /* A loop len for data type X can be reused for data type Y
> +  if X has N times more elements than Y and if Y's elements
> +  are N times bigger than X's.  */
> +   gcc_assert (multiple_p (nunits1, nunits2));
> +   unsigned int factor = exact_div (nunits1, nunits2).to_constant ();
> +   gimple_seq seq = NULL;
> +   loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len,
> +build_int_cst (iv_type, factor));
> +   if (seq)
> + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> + }
> +  return loop_len;
> +}

I don't think this is a property of decrementing IVs.  IIUC it's really
a property of rgl->factor == 1 && factor == 1, where factor would need
to be passed in by the caller.  Because of that, it should probably be
a separate patch.

That is, current LOAD_LEN targets have two properties (IIRC):

(1) all vectors used in a given piece of vector code have the same byte size
(2) lengths are measured in bytes rather than elements

For all cases, including SVE, the number of controls needed for a scalar
statement is equal to the number of vectors needed for that scalar
statement.

Because of (1), on current LOADL_LEN targets, the number of controls
needed for a scalar statement is also proportional to the total number
of bytes occupied by the vectors generated for that scalar statement.
And because of (2), the total number of bytes is the only thing that
matters, so all users of a particular control can use the same control
value.

E.g. on current LOAD_LEN targets, 2xV16QI and 2xV8HI would use the same
control (with no adjustment).  2xV16QI means 32 elements, while 2xV8HI
means 16 elements.  V16QI's nscalars_per_iter would therefore be double
V8HI's, but V8HI's factor would be double V16QI's (2 vs 1), so things
even out.

The code structurally supports targets that count in elements rather
than bytes, so that factor==1 for all element types.  See the
"rgl->factor == 1 && factor == 1" case in:

  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
{
  /* For now, we only support cases in which all loads and stores fall back
 to VnQI or none do.  */
  gcc_assert (!rgl->max_nscalars_per_iter
  || (rgl->factor == 1 && factor == 1)
  || (rgl->max_nscalars_per_iter * rgl->factor
  == nscalars_per_iter * factor));
  rgl->max_nscalars_per_iter = nscalars_per_iter;
  rgl->type = vectype;
  rgl->factor = factor;
}

But it hasn't been tested, since no current target uses it.

I think the above part of the patch shows that the current "factor is
always 1" path is in fact broken, and the patch is a correctness fix on
targets that measure in elements rather than bytes.

So I think the above part of the patch should go in ahead of the IV changes.
But the test should be based on factor rather than TYPE_VECTOR_SUBPARTS.

Thanks,
Richard


Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-19 Thread Richard Sandiford via Gcc-patches
Tejas Belagod  writes:
> Am I correct to understand that we still need to check for the case when
> there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2?
> eg. { 0, 0, 1, 1, 1, 1,} which should be encoded as {0, 0, 1, 1} with
> NPATTERNS = 2 ?

Yeah, that's right.  The current handling for NPATTERNS==2 looked
good to me.  It was the other two cases that I was worried about.

Thanks,
Richard


Re: [PATCH 1/4] Missed opportunity to use [SU]ABD

2023-05-18 Thread Richard Sandiford via Gcc-patches
Thanks for the update.  Some of these comments would have applied
to the first version, so sorry for not catching them first time.

 writes:
> From: oluade01 
>
> This adds a recognition pattern for the non-widening
> absolute difference (ABD).
>
> gcc/ChangeLog:
>
>   * doc/md.texi (sabd, uabd): Document them.
>   * internal-fn.def (ABD): Use new optab.
>   * optabs.def (sabd_optab, uabd_optab): New optabs,
>   * tree-vect-patterns.cc (vect_recog_absolute_difference):
>   Recognize the following idiom abs (a - b).
>   (vect_recog_sad_pattern): Refactor to use
>   vect_recog_absolute_difference.
>   (vect_recog_abd_pattern): Use patterns found by
>   vect_recog_absolute_difference to build a new ABD
>   internal call.
> ---
>  gcc/doc/md.texi   |  10 ++
>  gcc/internal-fn.def   |   3 +
>  gcc/optabs.def|   2 +
>  gcc/tree-vect-patterns.cc | 255 +-
>  4 files changed, 239 insertions(+), 31 deletions(-)
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 
> 07bf8bdebffb2e523f25a41f2b57e43c0276b745..3e65584d7efcd301f2c96a40edd82d30b84462b8
>  100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5778,6 +5778,16 @@ Other shift and rotate instructions, analogous to the
>  Vector shift and rotate instructions that take vectors as operand 2
>  instead of a scalar type.
>  
> +@cindex @code{uabd@var{m}} instruction pattern
> +@cindex @code{sabd@var{m}} instruction pattern
> +@item @samp{uabd@var{m}}, @samp{sabd@var{m}}
> +Signed and unsigned absolute difference instructions.  These
> +instructions find the difference between operands 1 and 2
> +then return the absolute value.  A C code equivalent would be:
> +@smallexample
> +op0 = op0 > op1 ? op0 - op1 : op1 - op0;

Should be:

  op0 = op1 > op2 ? op1 - op2 : op2 - op1;

since op0 is the output.

> +@end smallexample
> +
>  @cindex @code{avg@var{m}3_floor} instruction pattern
>  @cindex @code{uavg@var{m}3_floor} instruction pattern
>  @item @samp{avg@var{m}3_floor}
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 
> 7fe742c2ae713e7152ab05cfdfba86e4e0aa3456..0f1724ecf37a31c231572edf90b5577e2d82f468
>  100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -167,6 +167,9 @@ DEF_INTERNAL_OPTAB_FN (FMS, ECF_CONST, fms, ternary)
>  DEF_INTERNAL_OPTAB_FN (FNMA, ECF_CONST, fnma, ternary)
>  DEF_INTERNAL_OPTAB_FN (FNMS, ECF_CONST, fnms, ternary)
>  
> +DEF_INTERNAL_SIGNED_OPTAB_FN (ABD, ECF_CONST | ECF_NOTHROW, first,
> +   sabd, uabd, binary)
> +
>  DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first,
> savg_floor, uavg_floor, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 
> 695f5911b300c9ca5737de9be809fa01aabe5e01..29bc92281a2175f898634cbe6af63c18021e5268
>  100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -359,6 +359,8 @@ OPTAB_D (mask_fold_left_plus_optab, 
> "mask_fold_left_plus_$a")
>  OPTAB_D (extract_last_optab, "extract_last_$a")
>  OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")
>  
> +OPTAB_D (uabd_optab, "uabd$a3")
> +OPTAB_D (sabd_optab, "sabd$a3")
>  OPTAB_D (savg_floor_optab, "avg$a3_floor")
>  OPTAB_D (uavg_floor_optab, "uavg$a3_floor")
>  OPTAB_D (savg_ceil_optab, "avg$a3_ceil")
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 
> a49b09539776c0056e77f99b10365d0a8747fbc5..50f1822f220c023027f4b0f777965f3757842fa2
>  100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -770,6 +770,93 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info 
> stmt2_info, tree new_rhs,
>  }
>  }
>  
> +/* Look for the following pattern
> + X = x[i]
> + Y = y[i]
> + DIFF = X - Y
> + DAD = ABS_EXPR
> +
> +   ABS_STMT should point to a statement of code ABS_EXPR or ABSU_EXPR.
> +   If REJECT_UNSIGNED is true it aborts if the type of ABS_STMT is unsigned.
> +   HALF_TYPE and UNPROM will be set should the statement be found to
> +   be a widened operation.
> +   DIFF_OPRNDS will be set to the two inputs of the MINUS_EXPR preceding
> +   ABS_STMT, otherwise it will be set the operations found by
> +   vect_widened_op_tree.
> + */
> +static bool
> +vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt,
> + tree *half_type, bool reject_unsigned,
> + vect_unpromoted_value unprom[2],
> + tree diff_oprnds[2])
> +{
> +  if (!abs_stmt)
> +return false;
> +
> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a 
> phi
> + inside the loop (in case we are analyzing an outer-loop).  */
> +  enum tree_code code = gimple_assign_rhs_code (abs_stmt);
> +  if (code != ABS_EXPR && code != ABSU_EXPR)
> +return false;
> +
> +  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
> + 

Re: [aarch64] Code-gen for vector initialization involving constants

2023-05-18 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Thu, 18 May 2023 at 13:37, Richard Sandiford
>  wrote:
>>
>> Prathamesh Kulkarni  writes:
>> > On Tue, 16 May 2023 at 00:29, Richard Sandiford
>> >  wrote:
>> >>
>> >> Prathamesh Kulkarni  writes:
>> >> > Hi Richard,
>> >> > After committing the interleave+zip1 patch for vector initialization,
>> >> > it seems to regress the s32 case for this patch:
>> >> >
>> >> > int32x4_t f_s32(int32_t x)
>> >> > {
>> >> >   return (int32x4_t) { x, x, x, 1 };
>> >> > }
>> >> >
>> >> > code-gen:
>> >> > f_s32:
>> >> > moviv30.2s, 0x1
>> >> > fmovs31, w0
>> >> > dup v0.2s, v31.s[0]
>> >> > ins v30.s[0], v31.s[0]
>> >> > zip1v0.4s, v0.4s, v30.4s
>> >> > ret
>> >> >
>> >> > instead of expected code-gen:
>> >> > f_s32:
>> >> > moviv31.2s, 0x1
>> >> > dup v0.4s, w0
>> >> > ins v0.s[3], v31.s[0]
>> >> > ret
>> >> >
>> >> > Cost for fallback sequence: 16
>> >> > Cost for interleave and zip sequence: 12
>> >> >
>> >> > For the above case, the cost for interleave+zip1 sequence is computed 
>> >> > as:
>> >> > halves[0]:
>> >> > (set (reg:V2SI 96)
>> >> > (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
>> >> > cost = 8
>> >> >
>> >> > halves[1]:
>> >> > (set (reg:V2SI 97)
>> >> > (const_vector:V2SI [
>> >> > (const_int 1 [0x1]) repeated x2
>> >> > ]))
>> >> > (set (reg:V2SI 97)
>> >> > (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
>> >> > (reg:V2SI 97)
>> >> > (const_int 1 [0x1])))
>> >> > cost = 8
>> >> >
>> >> > followed by:
>> >> > (set (reg:V4SI 95)
>> >> > (unspec:V4SI [
>> >> > (subreg:V4SI (reg:V2SI 96) 0)
>> >> > (subreg:V4SI (reg:V2SI 97) 0)
>> >> > ] UNSPEC_ZIP1))
>> >> > cost = 4
>> >> >
>> >> > So the total cost becomes
>> >> > max(costs[0], costs[1]) + zip1_insn_cost
>> >> > = max(8, 8) + 4
>> >> > = 12
>> >> >
>> >> > While the fallback rtl sequence is:
>> >> > (set (reg:V4SI 95)
>> >> > (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
>> >> > cost = 8
>> >> > (set (reg:SI 98)
>> >> > (const_int 1 [0x1]))
>> >> > cost = 4
>> >> > (set (reg:V4SI 95)
>> >> > (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
>> >> > (reg:V4SI 95)
>> >> > (const_int 8 [0x8])))
>> >> > cost = 4
>> >> >
>> >> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 
>> >> > sequence.
>> >> >
>> >> > I think the issue is probably that for the interleave+zip1 sequence we 
>> >> > take
>> >> > max(costs[0], costs[1]) to reflect that both halves are interleaved,
>> >> > but for the fallback seq we use seq_cost, which assumes serial execution
>> >> > of insns in the sequence.
>> >> > For above fallback sequence,
>> >> > set (reg:V4SI 95)
>> >> > (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
>> >> > and
>> >> > (set (reg:SI 98)
>> >> > (const_int 1 [0x1]))
>> >> > could be executed in parallel, which would make it's cost max(8, 4) + 4 
>> >> > = 12.
>> >>
>> >> Agreed.
>> >>
>> >> A good-enough substitute for this might be to ignore scalar moves
>> >> (for both alternatives) when costing for speed.
>> > Thanks for the suggestions. Just wondering for aarch64, if there's an easy
>> > way we can check if insn is a scalar move, similar to riscv's 
>> > scalar_move_insn_p
>> > that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ?
>>
>> It should be enough to check that the pattern is a SET:
>>
>> (a) whose SET_DEST has a scalar mode and
>> (b) whose SET_SRC an aarch64_mov_operand
> Hi Richard,
> Thanks for the suggestions, the attached patch calls seq_cost to compute
> cost for sequence and then subtracts cost of each scalar move insn from it.
> Does that look OK ?
> The patch is under bootstrap+test on aarch64-linux-gnu.

Yeah, the patch looks reasonable (some comments below).  The testing
for this kind of patch is more than a formality though, so it would
be good to wait to see if the tests pass.

> [...]
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 29dbacfa917..7efd896d364 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22332,6 +22332,32 @@ aarch64_unzip_vector_init (machine_mode mode, rtx 
> vals, bool even_p)
>return gen_rtx_PARALLEL (new_mode, vec);
>  }
>  
> +/* Return true if INSN is a scalar move.  */
> +
> +static bool
> +scalar_move_insn_p (rtx_insn *insn)
> +{
> +  rtx set = single_set (insn);
> +  if (!set)
> +return false;
> +  rtx src = SET_SRC (set);
> +  rtx dest = SET_DEST (set);
> +  return is_a(GET_MODE (dest)) && aarch64_mov_operand_p (src, 
> GET_MODE (src));

Long line.

> +}
> +
> +/* Ignore cost for scalar moves from cost of sequence. This function is 
> called
> +   for calculating sequence costs in aarch64_expand_vector_init.  */
> +
> +static unsigned
> +seq_cost_ignore_scalar_moves (rtx_insn *seq, bool speed)

Maybe more readable as "ignoring" rather 

Re: [PATCH] Machine_Mode: Extend machine_mode from 8 to 16 bits

2023-05-18 Thread Richard Sandiford via Gcc-patches
pan2...@intel.com writes:
> diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h
> index c5180b9308a..c2103a5cb5c 100644
> --- a/gcc/rtl-ssa/accesses.h
> +++ b/gcc/rtl-ssa/accesses.h
> @@ -215,7 +215,11 @@ private:
>  
>// The values returned by the accessors above.
>unsigned int m_regno;
> -  access_kind m_kind : 8;
> +
> +  // The value returned by the accessor above.
> +  machine_mode m_mode : MACHINE_MODE_BITSIZE;
> +
> +  access_kind m_kind : 2;

There's no need to repeat the comment.  Just:

  // The values returned by the accessors above.
  unsigned int m_regno;
  machine_mode m_mode : MACHINE_MODE_BITSIZE;
  access_kind m_kind : 2;

would be enough.

OK with that change, thanks.  (There's no need to post the updated patch.)

Richard


Re: [aarch64] Code-gen for vector initialization involving constants

2023-05-18 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Tue, 16 May 2023 at 00:29, Richard Sandiford
>  wrote:
>>
>> Prathamesh Kulkarni  writes:
>> > Hi Richard,
>> > After committing the interleave+zip1 patch for vector initialization,
>> > it seems to regress the s32 case for this patch:
>> >
>> > int32x4_t f_s32(int32_t x)
>> > {
>> >   return (int32x4_t) { x, x, x, 1 };
>> > }
>> >
>> > code-gen:
>> > f_s32:
>> > moviv30.2s, 0x1
>> > fmovs31, w0
>> > dup v0.2s, v31.s[0]
>> > ins v30.s[0], v31.s[0]
>> > zip1v0.4s, v0.4s, v30.4s
>> > ret
>> >
>> > instead of expected code-gen:
>> > f_s32:
>> > moviv31.2s, 0x1
>> > dup v0.4s, w0
>> > ins v0.s[3], v31.s[0]
>> > ret
>> >
>> > Cost for fallback sequence: 16
>> > Cost for interleave and zip sequence: 12
>> >
>> > For the above case, the cost for interleave+zip1 sequence is computed as:
>> > halves[0]:
>> > (set (reg:V2SI 96)
>> > (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
>> > cost = 8
>> >
>> > halves[1]:
>> > (set (reg:V2SI 97)
>> > (const_vector:V2SI [
>> > (const_int 1 [0x1]) repeated x2
>> > ]))
>> > (set (reg:V2SI 97)
>> > (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
>> > (reg:V2SI 97)
>> > (const_int 1 [0x1])))
>> > cost = 8
>> >
>> > followed by:
>> > (set (reg:V4SI 95)
>> > (unspec:V4SI [
>> > (subreg:V4SI (reg:V2SI 96) 0)
>> > (subreg:V4SI (reg:V2SI 97) 0)
>> > ] UNSPEC_ZIP1))
>> > cost = 4
>> >
>> > So the total cost becomes
>> > max(costs[0], costs[1]) + zip1_insn_cost
>> > = max(8, 8) + 4
>> > = 12
>> >
>> > While the fallback rtl sequence is:
>> > (set (reg:V4SI 95)
>> > (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
>> > cost = 8
>> > (set (reg:SI 98)
>> > (const_int 1 [0x1]))
>> > cost = 4
>> > (set (reg:V4SI 95)
>> > (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
>> > (reg:V4SI 95)
>> > (const_int 8 [0x8])))
>> > cost = 4
>> >
>> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.
>> >
>> > I think the issue is probably that for the interleave+zip1 sequence we take
>> > max(costs[0], costs[1]) to reflect that both halves are interleaved,
>> > but for the fallback seq we use seq_cost, which assumes serial execution
>> > of insns in the sequence.
>> > For above fallback sequence,
>> > set (reg:V4SI 95)
>> > (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
>> > and
>> > (set (reg:SI 98)
>> > (const_int 1 [0x1]))
>> > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 
>> > 12.
>>
>> Agreed.
>>
>> A good-enough substitute for this might be to ignore scalar moves
>> (for both alternatives) when costing for speed.
> Thanks for the suggestions. Just wondering for aarch64, if there's an easy
> way we can check if insn is a scalar move, similar to riscv's 
> scalar_move_insn_p
> that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ?

It should be enough to check that the pattern is a SET:

(a) whose SET_DEST has a scalar mode and
(b) whose SET_SRC an aarch64_mov_operand 

>> > I was wondering if we should we make cost for interleave+zip1 sequence
>> > more conservative
>> > by not taking max, but summing up costs[0] + costs[1] even for speed ?
>> > For this case,
>> > that would be 8 + 8 + 4 = 20.
>> >
>> > It generates the fallback sequence for other cases (s8, s16, s64) from
>> > the test-case.
>>
>> What does it do for the tests in the interleave+zip1 patch?  If it doesn't
>> make a difference there then it sounds like we don't have enough tests. :)
> Oh right, the tests in interleave+zip1 patch only check for s16 case,
> sorry about that :/
> Looking briefly at the code generated for s8, s32 and s64 case,
> (a) s8, and s16 seem to use same sequence for all cases.
> (b) s64 seems to use fallback sequence.
> (c) For vec-init-21.c, s8 and s16 cases prefer fallback sequence
> because costs are tied,
> while s32 case prefers interleave+zip1:
>
> int32x4_t f_s32(int32_t x, int32_t y)
> {
>   return (int32x4_t) { x, y, 1, 2 };
> }
>
> Code-gen with interleave+zip1 sequence:
> f_s32:
> moviv31.2s, 0x1
> moviv0.2s, 0x2
> ins v31.s[0], w0
> ins v0.s[0], w1
> zip1v0.4s, v31.4s, v0.4s
> ret
>
> Code-gen with fallback sequence:
> f_s32:
> adrpx2, .LC0
> ldr q0, [x2, #:lo12:.LC0]
> ins v0.s[0], w0
> ins v0.s[1], w1
> ret
>
> Fallback sequence cost = 20
> interleave+zip1 sequence cost = 12
> I assume interleave+zip1 sequence is better in this case (chosen currently) ?
>
> I will send a patch to add cases for s8, s16 and s64 in a follow up patch 
> soon.
>>
>> Summing is only conservative if the fallback sequence is somehow "safer".
>> But I don't think it is.   Building an N-element vector from N scalars
>> can be done using N instructions in the fallback case and N+1 

Re: Taking Over MIPS Maintenance

2023-05-18 Thread Richard Sandiford via Gcc
YunQiang Su  writes:
> Greetings all,
>
> I would like to self-nominate as the new GCC MIPS maintainer. Matthew Fortune 
> is listed in MAINTAINERS as the current maintainer of GCC's MIPS Port. 
> However, it has been years since he left MIPS Technologies and had since been 
> inactive.
>
> I currently work for CIP United Inc., which is the exclusive licensor and 
> operator of MIPS IPs in China, Hong Kong, and Macau. Part of our operations 
> include maintaining open source software solutions for MIPS and we are 
> looking to continue maintaining GCC's MIPS port. As the director of the 
> company's software ecosystem department, I have been working with GCC and 
> contributed code to the upstream repository since 2021. In September 2021, I 
> was given write access to the repository:
>
> https://gcc.gnu.org/git/?p=gcc.git=search=HEAD=author=YunQiang+Su
>
> Please let me know about your thoughts on this proposal.

FWIW, I'd support this.  The MIPS port has been unmaintained for
many years now.  As the previous maintainer before Matthew, I've
tried to cover the area a bit.  But

(a) It's now close to 15 years since I did any meaningful MIPS work,
so I've forgotten a great deal.

(b) Most new work will be specific to MIPSr6, which I have never used.

(c) It's been very difficult to find the time.

It would be more usual to wait a bit longer until someone becomes
maintainer.  But IMO that's only sensible when there's an existing
maintainer to cover the interim.

Thanks,
Richard


Re: [PATCH] rtl: AArch64: New RTL for ABD

2023-05-16 Thread Richard Sandiford via Gcc-patches
Sorry for the slow reply.

Oluwatamilore Adebayo  writes:
> From afa416dab831795f7e1114da2fb9e94ea3b8c519 Mon Sep 17 00:00:00 2001
> From: oluade01 
> Date: Fri, 14 Apr 2023 15:10:07 +0100
> Subject: [PATCH 2/4] AArch64: New RTL for ABD
>
> This patch adds new RTL and tests for sabd and uabd
>
> PR tree-optimization/109156
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-simd-builtins.def (sabd, uabd):
> Change the mode to 3.
> * config/aarch64/aarch64-simd.md (aarch64_abd):
> Rename to abd3.
> * config/aarch64/aarch64-sve.md (abd_3): Rename
> to abd3.

Thanks.  These changes look good, once the vectoriser part is sorted,
but I have some comments about the tests:

> diff --git a/gcc/testsuite/gcc.target/aarch64/abd.h 
> b/gcc/testsuite/gcc.target/aarch64/abd.h
> new file mode 100644
> index 
> ..bc38e8508056cf2623cddd6053bf1cec3fa4ece4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/abd.h
> @@ -0,0 +1,62 @@
> +#ifdef ABD_IDIOM
> +
> +#define TEST1(S, TYPE) \
> +void fn_##S##_##TYPE (S TYPE * restrict a, \
> + S TYPE * restrict b,  \
> + S TYPE * restrict out) {  \
> +  for (int i = 0; i < N; i++) {\
> +signed TYPE diff = b[i] - a[i];\
> +out[i] = diff > 0 ? diff : -diff;  \
> +} }
> +
> +#define TEST2(S, TYPE1, TYPE2) \
> +void fn_##S##_##TYPE1##_##TYPE1##_##TYPE2  \
> +(S TYPE1 * restrict a, \
> + S TYPE1 * restrict b, \
> + S TYPE2 * restrict out) { \
> +  for (int i = 0; i < N; i++) {\
> +signed TYPE2 diff = b[i] - a[i];   \
> +out[i] = diff > 0 ? diff : -diff;  \
> +} }
> +
> +#define TEST3(S, TYPE1, TYPE2, TYPE3)  \
> +void fn_##S##_##TYPE1##_##TYPE2##_##TYPE3  \
> +(S TYPE1 * restrict a, \
> + S TYPE2 * restrict b, \
> + S TYPE3 * restrict out) { \
> +  for (int i = 0; i < N; i++) {\
> +signed TYPE3 diff = b[i] - a[i];   \
> +out[i] = diff > 0 ? diff : -diff;  \
> +} }
> +
> +#endif
> +
> +#ifdef ABD_ABS
> +
> +#define TEST1(S, TYPE) \
> +void fn_##S##_##TYPE (S TYPE * restrict a, \
> + S TYPE * restrict b,  \
> + S TYPE * restrict out) {  \
> +  for (int i = 0; i < N; i++)  \
> +out[i] = __builtin_abs(a[i] - b[i]);   \
> +}
> +
> +#define TEST2(S, TYPE1, TYPE2) \
> +void fn_##S##_##TYPE1##_##TYPE1##_##TYPE2  \
> +(S TYPE1 * restrict a, \
> + S TYPE1 * restrict b, \
> + S TYPE2 * restrict out) { \
> +  for (int i = 0; i < N; i++)  \
> +out[i] = __builtin_abs(a[i] - b[i]);   \
> +}
> +
> +#define TEST3(S, TYPE1, TYPE2, TYPE3)  \
> +void fn_##S##_##TYPE1##_##TYPE2##_##TYPE3  \
> +(S TYPE1 * restrict a, \
> + S TYPE2 * restrict b, \
> + S TYPE3 * restrict out) { \
> +  for (int i = 0; i < N; i++)  \
> +out[i] = __builtin_abs(a[i] - b[i]);   \
> +}
> +
> +#endif

It would be good to mark all of these functions with __attribute__((noipa)),
since I think interprocedural optimisations might otherwise defeat the
runtime test in abd_run_1.c (in the sense that we might end up folding
things at compile time and not testing the vector versions of the functions).

> diff --git a/gcc/testsuite/gcc.target/aarch64/abd_2.c 
> b/gcc/testsuite/gcc.target/aarch64/abd_2.c
> new file mode 100644
> index 
> ..45bcfabe05a395f6775f78f28c73eb536ba5654e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/abd_2.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +
> +#pragma GCC target "+nosve"
> +#define N 1024
> +
> +#define ABD_ABS
> +#include "abd.h"
> +
> +TEST1(signed, int)
> +TEST1(signed, short)
> +TEST1(signed, char)
> +
> +TEST2(signed, char, int)
> +TEST2(signed, char, short)
> +
> +TEST3(signed, char, int, short)
> +TEST3(signed, char, short, int)
> +
> +TEST1(unsigned, int)
> +TEST1(unsigned, short)
> +TEST1(unsigned, char)
> +
> +TEST2(unsigned, char, int)
> +TEST2(unsigned, char, short)
> +
> +TEST3(unsigned, char, int, short)
> +TEST3(unsigned, char, short, int)
> +
> +/* { dg-final { scan-assembler-times "sabd\\tv\[0-9\]+\.4s, v\[0-9\]+\.4s, 
> v\[0-9\]+\.4s" 2 } } */
> +/* { dg-final { scan-assembler-times "sabd\\tv\[0-9\]+\.8h, v\[0-9\]+\.8h, 
> v\[0-9\]+\.8h" 1 } } */
> +/* { dg-final { scan-assembler-times "sabd\\tv\[0-9\]+\.16b, v\[0-9\]+\.16b, 
> v\[0-9\]+\.16b" 1 } } */
> +/* { dg-final { scan-assembler-times "uabd\\tv\[0-9\]+\.8h, v\[0-9\]+\.8h, 
> 

Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-16 Thread Richard Sandiford via Gcc-patches
Tejas Belagod  writes:
>>> +   {
>>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>>> + bitsize_int (step * BITS_PER_UNIT),
>>> + bitsize_int ((16 - step) * BITS_PER_UNIT));
>>> +
>>> + return gimple_build_assign (f.lhs, b);
>>> +   }
>>> +
>>> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>>> +'step_1' in
>>> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>>> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>>> +elements followed by all inactive elements.  */
>>> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>>
>> Following on from the above, maybe use:
>>
>>   !VECTOR_CST_NELTS (pred).is_constant ()
>>
>> instead of !const_vl here.
>>
>> I have a horrible suspicion that I'm contradicting our earlier discussion
>> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>>
>> 
>>
>> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the
> encoded
>> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
>> base1 repeats in the encoding. This loop is checking this condition and looks
>> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector.
>> Please correct me if I’m misunderstanding here.
>
> NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the
> entire sequence is repeated to fill a vector.  So if an NELTS_PER_PATTERN
> == 1 constant has elements {0, 1, 0, 0}, the vector is:
>
>{0, 1, 0, 0, 0, 1, 0, 0, ...}
>
>
> Wouldn’t the vect_all_same(pred, step) cover this case for a given value of
> step?
>
>
> and the optimisation can't handle that.  NELTS_PER_PATTERN == 3 isn't
> likely to occur for predicates, but in principle it has the same problem.
>
>  
>
> OK, I had misunderstood the encoding to always make base1 the repeating value
> by adjusting the NPATTERNS accordingly – I didn’t know you could also have the
> base2 value and beyond encoding the repeat value. In this case could I just
> remove NELTS_PER_PATTERN == 2 condition and the enclosed loop would check for 
> a
> repeating ‘1’ in the repeated part of the encoded pattern?

But for NELTS_PER_PATTERN==1, the whole encoded sequence repeats.
So you would have to start the check at element 0 rather than
NPATTERNS.  And then (for NELTS_PER_PATTERN==1) the loop would reject
any constant that has a nonzero element.  But all valid zero-vector
cases have been handled by this point, so the effect wouldn't be useful.

It should never be the case that all elements from NPATTERNS
onwards are zero for NELTS_PER_PATTERN==3; that case should be
canonicalised to NELTS_PER_PATTERN==2 instead.

So in practice it's simpler and more obviously correct to punt
when NELTS_PER_PATTERN != 2.

Thanks,
Richard


[PATCH] aarch64: Allow moves after tied-register intrinsics (2nd edition)

2023-05-16 Thread Richard Sandiford via Gcc-patches
I missed these two in g:4ff89f10ca0d41f9cfa76 because I was
testing on a system that didn't support big-endian compilation.
Testing on aarch64_be-elf shows no other related failures
(although the overall results are worse than for little-endian).

Tested on aarch64_be-elf & pushed.

Richard


gcc/testsuite/
* gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c: Allow mves
to occur after the intrinsic instruction, rather than requiring
them to happen before.
* gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c: Likewise.
---
 .../gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c| 10 ++
 .../gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   | 10 ++
 2 files changed, 20 insertions(+)

diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c
index ae0a953f7b4..9975edb8fdb 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c
@@ -70,8 +70,13 @@ float32x4_t ufooq_lane(float32x4_t r, bfloat16x8_t x, 
bfloat16x4_t y)
 
 /*
 **ufoo_untied:
+** (
 ** mov v0.8b, v1.8b
 ** bfdot   v0.2s, (v2.4h, v3.4h|v3.4h, v2.4h)
+** |
+** bfdot   v1.2s, (v2.4h, v3.4h|v3.4h, v2.4h)
+** mov v0.8b, v1.8b
+** )
 ** ret
 */
 float32x2_t ufoo_untied(float32x4_t unused, float32x2_t r, bfloat16x4_t x, 
bfloat16x4_t y)
@@ -81,8 +86,13 @@ float32x2_t ufoo_untied(float32x4_t unused, float32x2_t r, 
bfloat16x4_t x, bfloa
 
 /*
 **ufooq_lane_untied:
+** (
 ** mov v0.16b, v1.16b
 ** bfdot   v0.4s, v2.8h, v3.2h\[1\]
+** |
+** bfdot   v1.4s, v2.8h, v3.2h\[1\]
+** mov v0.16b, v1.16b
+** )
 ** ret
 */
 float32x4_t ufooq_lane_untied(float32x4_t unused, float32x4_t r, bfloat16x8_t 
x, bfloat16x4_t y)
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c
index 61c7c51f5ec..76787f6bedd 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c
@@ -115,8 +115,13 @@ int32x4_t sfooq_laneq (int32x4_t r, int8x16_t x, 
uint8x16_t y)
 
 /*
 **ufoo_untied:
+** (
 ** mov v0\.8b, v1\.8b
 ** usdot   v0\.2s, v2\.8b, v3\.8b
+** |
+** usdot   v1\.2s, v2\.8b, v3\.8b
+** mov v0\.8b, v1\.8b
+** )
 ** ret
 */
 int32x2_t ufoo_untied (int32x2_t unused, int32x2_t r, uint8x8_t x, int8x8_t y)
@@ -126,8 +131,13 @@ int32x2_t ufoo_untied (int32x2_t unused, int32x2_t r, 
uint8x8_t x, int8x8_t y)
 
 /*
 **ufooq_laneq_untied:
+** (
 ** mov v0\.16b, v1\.16b
 ** usdot   v0\.4s, v2\.16b, v3\.4b\[3\]
+** |
+** usdot   v1\.4s, v2\.16b, v3\.4b\[3\]
+** mov v0\.16b, v1\.16b
+** )
 ** ret
 */
 int32x4_t ufooq_laneq_untied (int32x2_t unused, int32x4_t r, uint8x16_t x, 
int8x16_t y)
-- 
2.25.1



Re: [PATCH v3] Machine_Mode: Extend machine_mode from 8 to 16 bits

2023-05-16 Thread Richard Sandiford via Gcc-patches
pan2...@intel.com writes:
> diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h
> index c5180b9308a..38b4d6160c2 100644
> --- a/gcc/rtl-ssa/accesses.h
> +++ b/gcc/rtl-ssa/accesses.h
> @@ -254,7 +254,7 @@ private:
>unsigned int m_spare : 2;
>  
>// The value returned by the accessor above.
> -  machine_mode m_mode : 8;
> +  machine_mode m_mode : MACHINE_MODE_BITSIZE;
>  };
>  
>  // A contiguous array of access_info pointers.  Used to represent a

This structure (access_info) isn't mentioned in the table in the patch
description.  The structure is currently 1 LP64 word and is very
size-sensitive.  I think we should:

- Put the mode after m_regno
- Reduce m_kind to 2 bits
- Remove m_spare

I *think* that will keep the current size, but please check.

LGTM otherwise.

Thanks,
Richard


Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-16 Thread Richard Sandiford via Gcc-patches
Tejas Belagod  writes:
>> +  {
>> +int i;
>> +int nelts = vector_cst_encoded_nelts (v);
>> +int first_el = 0;
>> +
>> +for (i = first_el; i < nelts; i += step)
>> +  if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v,
> first_el))
>
> I think this should use !operand_equal_p (..., ..., 0).
>
>
> Oops! I wonder why I thought VECTOR_CST_ENCODED_ELT returned a constant! 
> Thanks
> for spotting that.

It does only return a constant.  But there can be multiple trees with
the same constant value, through things like TREE_OVERFLOW (not sure
where things stand on expunging that from gimple) and the fact that
gimple does not maintain a distinction between different types that
have the same mode and signedness.  (E.g. on ILP32 hosts, gimple does
not maintain a distinction between int and long, even though int 0 and
long 0 are different trees.)

> Also, should the flags here be OEP_ONLY_CONST ?

Nah, just 0 should be fine.

>> + return false;
>> +
>> +return true;
>> +  }
>> +
>> +  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
>> + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
>> + the index of the element being accessed is in the range of a NEON
> vector
>> + width.  */
>
> s/NEON/Advanced SIMD/.  Same in later comments
>
>> +  gimple *fold (gimple_folder & f) const override
>> +  {
>> +tree pred = gimple_call_arg (f.call, 0);
>> +tree val = gimple_call_arg (f.call, 1);
>> +
>> +if (TREE_CODE (pred) == VECTOR_CST)
>> +  {
>> + HOST_WIDE_INT pos;
>> + unsigned int const_vg;
>> + int i = 0;
>> + int step = f.type_suffix (0).element_bytes;
>> + int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
>> + int npats = VECTOR_CST_NPATTERNS (pred);
>> + unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred);
>> + tree b = NULL_TREE;
>> + bool const_vl = aarch64_sve_vg.is_constant (_vg);
>
> I think this might be left over from previous versions, but:
> const_vg isn't used and const_vl is only used once, so I think it
> would be better to remove them.
>
>> +
>> + /* We can optimize 2 cases common to variable and fixed-length cases
>> +without a linear search of the predicate vector:
>> +1.  LASTA if predicate is all true, return element 0.
>> +2.  LASTA if predicate all false, return element 0.  */
>> + if (is_lasta () && vect_all_same (pred, step_1))
>> +   {
>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> + bitsize_int (step * BITS_PER_UNIT), bitsize_int (0));
>> + return gimple_build_assign (f.lhs, b);
>> +   }
>> +
>> + /* Handle the all-false case for LASTB where SVE VL == 128b -
>> +return the highest numbered element.  */
>> + if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16)
>> + && vect_all_same (pred, step_1)
>> + && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0)))
>
> Formatting nit: one condition per line once one line isn't enough.
>
>> +   {
>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> + bitsize_int (step * BITS_PER_UNIT),
>> + bitsize_int ((16 - step) * BITS_PER_UNIT));
>> +
>> + return gimple_build_assign (f.lhs, b);
>> +   }
>> +
>> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>> +'step_1' in
>> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>> +elements followed by all inactive elements.  */
>> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>
> Following on from the above, maybe use:
>
>   !VECTOR_CST_NELTS (pred).is_constant ()
>
> instead of !const_vl here.
>
> I have a horrible suspicion that I'm contradicting our earlier discussion
> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>
>  
>
> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the encoded
> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
> base1 repeats in the encoding. This loop is checking this condition and looks
> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector.
> Please correct me if I’m misunderstanding here.

NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the
entire sequence is repeated to fill a vector.  So if an NELTS_PER_PATTERN
== 1 constant has elements {0, 1, 0, 0}, the vector is:

   {0, 1, 0, 0, 0, 1, 0, 0, ...}

and the optimisation can't handle that.  NELTS_PER_PATTERN == 3 isn't
likely to occur for predicates, but in principle it has the same problem.

Thanks,
Richard


Re: [PATCH V7] VECT: Add decrement IV support in Loop Vectorizer

2023-05-16 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Hi, Richard.
>
> RVV infrastructure in RISC-V backend status:
> 1. All RVV instructions pattern related to intrinsics are all finished (They 
> will be called not only by intrinsics but also autovec in the future).
> 2. In case of autovec, we finished len_load/len_store (They are temporary 
> used and will be removed after I support len_mask_load/len_mask_store in the 
> middle-end).
>binary integer autovec patterns.
>vec_init pattern.
>That's all we have so far.

Thanks.

> In case of testing of this patch, I have multiple rgroup testcases in local, 
> you mean you want me to post them together with this patch?
> Since I am gonna to put them in RISC-V backend testsuite, I was planning to 
> post them after this patch is finished and merged into trunk.
> What do you suggest ?

It would be useful to include the tests with the patch itself (as a patch
to the testsuite).  It doesn't matter that the tests are riscv-specific.

Obviously it would be more appropriate for the riscv maintainers to
review the riscv tests.  But keeping the tests with the patch helps when
reviewing the code, and also ensures that code is committed and never
later tested.

Richard


Re: [PATCH V7] VECT: Add decrement IV support in Loop Vectorizer

2023-05-16 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Oh, 
> I am sorry for incorrect typos in the last email, fix typos :
>
> Hi, Richard.
> For case 2, I come up with this idea:
> +  Case 2 (SLP multiple rgroup):
> + ...
> + _38 = (unsigned long) n_12(D);
> + _39 = _38 * 2;
> + _40 = MAX_EXPR <_39, 16>;   ->remove
> + _41 = _40 - 16; ->remove
>
> + ...
> + # ivtmp_42 = PHI   ->remove
>
> + # ivtmp_45 = PHI 
> + ...
> + _44 = MIN_EXPR ;  ->remove
>
> + _47 = MIN_EXPR ;+   _47_2 = MIN_EXPR 
> <_47, 16>;  >add+   _47_3 = _47 - _47_2 ; > add
> + ...
> + .LEN_STORE (_6, 8B, _47_2, ...);
> + ...
> + .LEN_STORE (_25, 8B, _47_3, ...);
> + _33 = _47_2 / 2;
> + ...
> + .LEN_STORE (_8, 16B, _33, ...);
> + _36 = _47_3 / 2;
> + ...
> + .LEN_STORE (_15, 16B, _36, ...);
> + ivtmp_46 = ivtmp_45 - _47;
> + ivtmp_43 = ivtmp_42 - _44;  ->remove
>
> + ...
> + if (ivtmp_46 != 0)
> +   goto ; [83.33%]
> + else
> +   goto ; [16.67%]
> Is it reasonable ? Or you do have better idea for it?

Yeah, this makes sense, and I think it makes case 2 very similar
(equivalent?) to case 3.  If so, it would be nice if they could be
combined.

Of course, this loses the nice property that the original had: that each
IV was independent, and so the dependency chains were shorter.  With the
above approach, the second length parameter instead depends on a
three-instruction chain.  But that might be OK (up to you).

How much of the riscv backend infrastructure is in place now?  The reason
I ask is that it would be good if the patch had some tests.  AIUI, the
patch is an optimisation on top of what the current len_load/store code does,
rather than something that is needed for correctness.  So it seems like
the necessary patterns could be added and tested using the current approach,
then this patch could be applied on top, with its own tests for the new
approach.

Thanks,
Richard


Re: [PATCH v3] Machine_Mode: Extend machine_mode from 8 to 16 bits

2023-05-16 Thread Richard Sandiford via Gcc-patches
"Li, Pan2"  writes:
> Kindly ping for this PATCH v3.

The patch was sent on Saturday, so this is effectively pinging after
one working day in most of Europe and America.  That's too soon and
comes across as aggressive.

I realise you and others are working intensively on this.  But in a
sense that's part of the reason why reviews might seem slow.  The volume
of RVV patches recently has been pretty high, so it's been difficult to
keep up.  There are have also been many other non-RVV patches that have
been "unlocked" by stage 1 opening, so there's a high volume from that
as well.

Also, please bear in mind that most people active in the GCC community
have their own work to do and can only a dedicate a certain amount of
the day to reviews.  And reviewing patches can be time-consuming in
itsself.

So sometimes a patch will get a review within the day.  Sometimes it
will take a bit longer.  The fact that a patch doesn't get a response
within one working day doesn't mean that it's been forgotten.

Thanks,
Richard


Re: [PATCH V7] VECT: Add decrement IV support in Loop Vectorizer

2023-05-16 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
>>> The examples are good, but this one made me wonder: why is the
>>> adjustment made to the limit (namely 16, the gap between _39 and _41)
>>> different from the limits imposed by the MIN_EXPR (32)?  And I think
>>> the answer is that:
>
>>> - _47 counts the number of elements processed by the loop in total,
>>>   including the vectors under the control of _44
>
>>> - _44 counts the number of elements controlled by _47 in the next
>>>   iteration of the vector loop (if there is one)
>
>>> And that's needed to allow the IVs to be updated independently.
>
>>> The difficulty with this is that the len_load* and len_store*
>>> optabs currently say that the behaviour is undefined if the
>>> length argument is greater than the length of a vector.
>>> So I think using these values of _47 and _44 in the .LEN_STOREs
>>> is relying on undefined behaviour.
>
>>> Haven't had time to think about the consequences of that yet,
>>> but wanted to send something out sooner rather than later.
>
> Hi, Richard. I totally understand your concern now. I think the undefine 
> behavior is more
> appropriate for RVV since we have vsetvli instruction that gurantee this will 
> cause potential
> issues. However, for some other target, we may need to use additional 
> MIN_EXPR to guard
> the length never over VF. I think it can be addressed in the future when it 
> is needed.

But we can't generate (vector) gimple that has undefined behaviour from
(scalar) gimple that had defined behaviour.  So something needs to change.
Either we need to generate a different sequence, or we need to define
what the behaviour of len_load/store/etc. are when the length is out of
range (perhaps under a target hook?).

We also need to be consistent.  If case 2 is allowed to use length
parameters that are greater than the vector length, then there's no
reason for case 1 to use the result of the MIN_EXPR as the length
parameter.  It could just use the loop IV directly.  (I realise the
select_vl patch will change case 1 for RVV anyway.  But the principle
still holds.)

What does the riscv backend's implementation of the len_load and
len_store guarantee?  Is any length greater than the vector length
capped to the vector length?  Or is it more complicated than that?

Thanks,
Richard


Re: [PATCH V7] VECT: Add decrement IV support in Loop Vectorizer

2023-05-15 Thread Richard Sandiford via Gcc-patches
juzhe.zh...@rivai.ai writes:
> From: Juzhe-Zhong 
>
> This patch implement decrement IV for length approach in loop control.
>
> Address comment from kewen that incorporate the implementation inside
> "vect_set_loop_controls_directly" instead of a standalone function.
>
> Address comment from Richard using MIN_EXPR to handle these 3 following
> cases
> 1. single rgroup.
> 2. multiple rgroup for SLP.
> 3. multiple rgroup for non-SLP (tested on vec_pack_trunc).

Thanks, this looks pretty reasonable to me FWIW, but some comments below:

> Bootstraped && Regression on x86.
>
> Ok for trunk ?
>
> gcc/ChangeLog:
>
> * tree-vect-loop-manip.cc (vect_adjust_loop_lens): New function.
> (vect_set_loop_controls_directly): Add decrement IV support.
> (vect_set_loop_condition_partial_vectors): Ditto.
> * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Add a new 
> variable.
> (vect_get_loop_len): Add decrement IV support.
> * tree-vect-stmts.cc (vectorizable_store): Ditto.
> (vectorizable_load): Ditto.
> * tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
> (vect_get_loop_len): Add decrement IV support.
>
> ---
>  gcc/tree-vect-loop-manip.cc | 177 +++-
>  gcc/tree-vect-loop.cc   |  38 +++-
>  gcc/tree-vect-stmts.cc  |   9 +-
>  gcc/tree-vectorizer.h   |  13 ++-
>  4 files changed, 224 insertions(+), 13 deletions(-)
>
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..1baac7b1b52 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -385,6 +385,58 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, 
> rgroup_controls *dest_rgm,
>return false;
>  }
>  
> +/* Try to use adjust loop lens for non-SLP multiple-rgroups.
> +
> + _36 = MIN_EXPR ;
> +
> + First length (MIN (X, VF/N)):
> +   loop_len_15 = MIN_EXPR <_36, VF/N>;
> +
> + Second length:
> +   tmp = _36 - loop_len_15;
> +   loop_len_16 = MIN (tmp, VF/N);
> +
> + Third length:
> +   tmp2 = tmp - loop_len_16;
> +   loop_len_17 = MIN (tmp2, VF/N);
> +
> + Forth length:
> +   tmp3 = tmp2 - loop_len_17;
> +   loop_len_18 = MIN (tmp3, VF/N);  */
> +
> +static void
> +vect_adjust_loop_lens (tree iv_type, gimple_seq *seq, rgroup_controls 
> *dest_rgm,
> +rgroup_controls *src_rgm)
> +{
> +  tree ctrl_type = dest_rgm->type;
> +  poly_uint64 nitems_per_ctrl
> += TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
> +
> +  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> +{
> +  tree src = src_rgm->controls[i / dest_rgm->controls.length ()];
> +  tree dest = dest_rgm->controls[i];
> +  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
> +  gassign *stmt;
> +  if (i == 0)
> + {
> +   /* MIN (X, VF*I/N) capped to the range [0, VF/N].  */
> +   stmt = gimple_build_assign (dest, MIN_EXPR, src, length_limit);
> +   gimple_seq_add_stmt (seq, stmt);
> + }
> +  else
> + {
> +   /* (MIN (remain, VF*I/N)) capped to the range [0, VF/N].  */
> +   tree temp = make_ssa_name (iv_type);
> +   stmt = gimple_build_assign (temp, MINUS_EXPR, src,
> +   dest_rgm->controls[i - 1]);
> +   gimple_seq_add_stmt (seq, stmt);
> +   stmt = gimple_build_assign (dest, MIN_EXPR, temp, length_limit);
> +   gimple_seq_add_stmt (seq, stmt);
> + }
> +}
> +}
> +
>  /* Helper for vect_set_loop_condition_partial_vectors.  Generate definitions
> for all the rgroup controls in RGC and return a control that is nonzero
> when the loop needs to iterate.  Add any new preheader statements to
> @@ -468,9 +520,10 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>gimple_stmt_iterator incr_gsi;
>bool insert_after;
>standard_iv_increment_position (loop, _gsi, _after);
> -  create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
> -  loop, _gsi, insert_after, _before_incr,
> -  _after_incr);
> +  if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> +create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
> +loop, _gsi, insert_after, _before_incr,
> +_after_incr);
>  
>tree zero_index = build_int_cst (compare_type, 0);
>tree test_index, test_limit, first_limit;
> @@ -552,8 +605,13 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>/* Convert the IV value to the comparison type (either a no-op or
>   a demotion).  */
>gimple_seq test_seq = NULL;
> -  test_index = gimple_convert (_seq, compare_type, test_index);
> -  gsi_insert_seq_before (test_gsi, test_seq, GSI_SAME_STMT);
> +  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> +test_limit = gimple_convert (preheader_seq, iv_type, nitems_total);
> +  else
> +  

Re: [aarch64] Code-gen for vector initialization involving constants

2023-05-15 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> Hi Richard,
> After committing the interleave+zip1 patch for vector initialization,
> it seems to regress the s32 case for this patch:
>
> int32x4_t f_s32(int32_t x)
> {
>   return (int32x4_t) { x, x, x, 1 };
> }
>
> code-gen:
> f_s32:
> moviv30.2s, 0x1
> fmovs31, w0
> dup v0.2s, v31.s[0]
> ins v30.s[0], v31.s[0]
> zip1v0.4s, v0.4s, v30.4s
> ret
>
> instead of expected code-gen:
> f_s32:
> moviv31.2s, 0x1
> dup v0.4s, w0
> ins v0.s[3], v31.s[0]
> ret
>
> Cost for fallback sequence: 16
> Cost for interleave and zip sequence: 12
>
> For the above case, the cost for interleave+zip1 sequence is computed as:
> halves[0]:
> (set (reg:V2SI 96)
> (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
> cost = 8
>
> halves[1]:
> (set (reg:V2SI 97)
> (const_vector:V2SI [
> (const_int 1 [0x1]) repeated x2
> ]))
> (set (reg:V2SI 97)
> (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
> (reg:V2SI 97)
> (const_int 1 [0x1])))
> cost = 8
>
> followed by:
> (set (reg:V4SI 95)
> (unspec:V4SI [
> (subreg:V4SI (reg:V2SI 96) 0)
> (subreg:V4SI (reg:V2SI 97) 0)
> ] UNSPEC_ZIP1))
> cost = 4
>
> So the total cost becomes
> max(costs[0], costs[1]) + zip1_insn_cost
> = max(8, 8) + 4
> = 12
>
> While the fallback rtl sequence is:
> (set (reg:V4SI 95)
> (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> cost = 8
> (set (reg:SI 98)
> (const_int 1 [0x1]))
> cost = 4
> (set (reg:V4SI 95)
> (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
> (reg:V4SI 95)
> (const_int 8 [0x8])))
> cost = 4
>
> So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.
>
> I think the issue is probably that for the interleave+zip1 sequence we take
> max(costs[0], costs[1]) to reflect that both halves are interleaved,
> but for the fallback seq we use seq_cost, which assumes serial execution
> of insns in the sequence.
> For above fallback sequence,
> set (reg:V4SI 95)
> (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> and
> (set (reg:SI 98)
> (const_int 1 [0x1]))
> could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12.

Agreed.

A good-enough substitute for this might be to ignore scalar moves
(for both alternatives) when costing for speed.

> I was wondering if we should we make cost for interleave+zip1 sequence
> more conservative
> by not taking max, but summing up costs[0] + costs[1] even for speed ?
> For this case,
> that would be 8 + 8 + 4 = 20.
>
> It generates the fallback sequence for other cases (s8, s16, s64) from
> the test-case.

What does it do for the tests in the interleave+zip1 patch?  If it doesn't
make a difference there then it sounds like we don't have enough tests. :)

Summing is only conservative if the fallback sequence is somehow "safer".
But I don't think it is.   Building an N-element vector from N scalars
can be done using N instructions in the fallback case and N+1 instructions
in the interleave+zip1 case.  But the interleave+zip1 case is still
better (speedwise) for N==16.

Thanks,
Richard


Re: [PATCH 2/6] aarch64: Allow moves after tied-register intrinsics

2023-05-15 Thread Richard Sandiford via Gcc-patches
Kyrylo Tkachov  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Monday, May 15, 2023 3:18 PM
>> To: Kyrylo Tkachov 
>> Cc: gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH 2/6] aarch64: Allow moves after tied-register intrinsics
>> 
>> Kyrylo Tkachov  writes:
>> > Hi Richard,
>> >
>> >> -Original Message-
>> >> From: Gcc-patches > >> bounces+kyrylo.tkachov=arm@gcc.gnu.org> On Behalf Of Richard
>> >> Sandiford via Gcc-patches
>> >> Sent: Tuesday, May 9, 2023 7:48 AM
>> >> To: gcc-patches@gcc.gnu.org
>> >> Cc: Richard Sandiford 
>> >> Subject: [PATCH 2/6] aarch64: Allow moves after tied-register intrinsics
>> >>
>> >> Some ACLE intrinsics map to instructions that tie the output
>> >> operand to an input operand.  If all the operands are allocated
>> >> to different registers, and if MOVPRFX can't be used, we will need
>> >> a move either before the instruction or after it.  Many tests only
>> >> matched the "before" case; this patch makes them accept the "after"
>> >> case too.
>> >>
>> >> gcc/testsuite/
>> >>   * gcc.target/aarch64/advsimd-intrinsics/bfcvtnq2-untied.c: Allow
>> >>   moves to occur after the intrinsic instruction, rather than 
>> >> requiring
>> >>   them to happen before.
>> >>   * gcc.target/aarch64/advsimd-intrinsics/bfdot-1.c: Likewise.
>> >>   * gcc.target/aarch64/advsimd-intrinsics/vdot-3-1.c: Likewise.
>> >
>> > I'm seeing some dot-product intrinsics failures:
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O1   
>> > check-function-
>> bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O1   
>> > check-function-
>> bodies ufooq_lane_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O2   
>> > check-function-
>> bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O2   
>> > check-function-
>> bodies ufooq_lane_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O2 -flto -fno-use-
>> linker-plugin -flto-partition=none   check-function-bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O2 -flto -fno-use-
>> linker-plugin -flto-partition=none   check-function-bodies ufooq_lane_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O3 -g   check-
>> function-bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O3 -g   check-
>> function-bodies ufooq_lane_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -Og -g   check-
>> function-bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -Og -g   check-
>> function-bodies ufooq_lane_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -Os   
>> > check-function-
>> bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -Os   
>> > check-function-
>> bodies ufooq_lane_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O1   check-
>> function-bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O1   check-
>> function-bodies ufooq_laneq_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O2   check-
>> function-bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O2   check-
>> function-bodies ufooq_laneq_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O2 -flto 
>> > -fno-use-
>> linker-plugin -flto-partition=none   check-function-bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O2 -flto 
>> > -fno-use-
>> linker-plugin -flto-partition=none   check-function-bodies
>> ufooq_laneq_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O3 -g   check-
>> function-bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O3 -g   check-
>> function-bodies ufooq_laneq_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -Og -g   check-
>> function-bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -Og -g   check-
>> function-bodies ufooq_laneq_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -Os   
>> > check-function-
>> bodies ufoo_untied
>> > FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -Os   
>> > check-function-
>> bodies ufooq_laneq_untied
>> 
>> Ugh.  Big-endian.  Hadn't thought about that being an issue.
>> Was testing natively on little-endian aarch64-linux-gnu and
>> didn't see these.
>
> FWIW this is on a little-endian aarch64-none-elf configuration.

Yeah, but the tests force big-endian, and require a  that
supports big-endian.  Newlib supports both endiannesses, but a given
glibc installation doesn't.  So the tests will be exercied on *-elf
of any endianness, but will only be exercised on *-linux-gnu for
big-endian.

Richard


Re: [PATCH 2/6] aarch64: Allow moves after tied-register intrinsics

2023-05-15 Thread Richard Sandiford via Gcc-patches
Kyrylo Tkachov  writes:
> Hi Richard,
>
>> -Original Message-
>> From: Gcc-patches > bounces+kyrylo.tkachov=arm@gcc.gnu.org> On Behalf Of Richard
>> Sandiford via Gcc-patches
>> Sent: Tuesday, May 9, 2023 7:48 AM
>> To: gcc-patches@gcc.gnu.org
>> Cc: Richard Sandiford 
>> Subject: [PATCH 2/6] aarch64: Allow moves after tied-register intrinsics
>>
>> Some ACLE intrinsics map to instructions that tie the output
>> operand to an input operand.  If all the operands are allocated
>> to different registers, and if MOVPRFX can't be used, we will need
>> a move either before the instruction or after it.  Many tests only
>> matched the "before" case; this patch makes them accept the "after"
>> case too.
>>
>> gcc/testsuite/
>>   * gcc.target/aarch64/advsimd-intrinsics/bfcvtnq2-untied.c: Allow
>>   moves to occur after the intrinsic instruction, rather than requiring
>>   them to happen before.
>>   * gcc.target/aarch64/advsimd-intrinsics/bfdot-1.c: Likewise.
>>   * gcc.target/aarch64/advsimd-intrinsics/vdot-3-1.c: Likewise.
>
> I'm seeing some dot-product intrinsics failures:
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O1   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O1   
> check-function-bodies ufooq_lane_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O2   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O2   
> check-function-bodies ufooq_lane_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O2 -flto 
> -fno-use-linker-plugin -flto-partition=none   check-function-bodies 
> ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O2 -flto 
> -fno-use-linker-plugin -flto-partition=none   check-function-bodies 
> ufooq_lane_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O3 -g   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -O3 -g   
> check-function-bodies ufooq_lane_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -Og -g   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -Og -g   
> check-function-bodies ufooq_lane_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -Os   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/bfdot-2.c   -Os   
> check-function-bodies ufooq_lane_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O1   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O1   
> check-function-bodies ufooq_laneq_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O2   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O2   
> check-function-bodies ufooq_laneq_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O2 -flto 
> -fno-use-linker-plugin -flto-partition=none   check-function-bodies 
> ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O2 -flto 
> -fno-use-linker-plugin -flto-partition=none   check-function-bodies 
> ufooq_laneq_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O3 -g   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -O3 -g   
> check-function-bodies ufooq_laneq_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -Og -g   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -Og -g   
> check-function-bodies ufooq_laneq_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -Os   
> check-function-bodies ufoo_untied
> FAIL: gcc.target/aarch64/advsimd-intrinsics/vdot-3-2.c   -Os   
> check-function-bodies ufooq_laneq_untied

Ugh.  Big-endian.  Hadn't thought about that being an issue.
Was testing natively on little-endian aarch64-linux-gnu and
didn't see these.

> From a quick inspection it looks like it's just an alternative regalloc that 
> moves the mov + dot instructions around, similar to what you fixed in 
> bfdot-2.c and vdot-3-2.c.
> I guess they need a similar adjustment?

Yeah, will fix.

Thanks,
Richard


Re: [PATCH 2/3] Refactor widen_plus as internal_fn

2023-05-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Mon, 15 May 2023, Richard Sandiford wrote:
>
>> Richard Biener  writes:
>> > But I'm also not sure
>> > how much of that is really needed (it seems to be tied around
>> > optimizing optabs space?)
>> 
>> Not sure what you mean by "this".  Optabs space shouldn't be a problem
>> though.  The optab encoding gives us a full int to play with, and it
>> could easily go up to 64 bits if necessary/convenient.
>> 
>> At least on the internal-fn.* side, the aim is really just to establish
>> a regular structure, so that we don't have arbitrary differences between
>> different widening operations, or too much cut-&-paste.
>
> Hmm, I'm looking at the need for the std::map and 
> internal_fn_hilo_keys_array and internal_fn_hilo_values_array.
> The vectorizer pieces contain
>
> +  if (code.is_fn_code ())
> + {
> +  internal_fn ifn = as_internal_fn ((combined_fn) code);
> +  gcc_assert (decomposes_to_hilo_fn_p (ifn));
> +
> +  internal_fn lo, hi;
> +  lookup_hilo_internal_fn (ifn, , );
> +  *code1 = as_combined_fn (lo);
> +  *code2 = as_combined_fn (hi);
> +  optab1 = lookup_hilo_ifn_optab (lo, !TYPE_UNSIGNED (vectype));
> +  optab2 = lookup_hilo_ifn_optab (hi, !TYPE_UNSIGNED (vectype));
>
> so that tries to automatically associate the scalar widening IFN
> with the set(s) of IFN pairs we can split to.  But then this
> list should be static and there's no need to create a std::map?
> Maybe gencfn-macros.cc can be enhanced to output these static
> cases?  Or the vectorizer could (as it did previously) simply
> open-code the handled cases (I guess since we deal with two
> cases only now I'd prefer that).

Ah, yeah, I pushed back against that too.  I think it should be possible
to do it using the preprocessor, if the macros are defined appropriately.
But if it isn't possible to do it with macros then I agree that a
generator would be better than initialisation within the compiler.

Thanks,
Richard


Re: [PATCH 2/3] Refactor widen_plus as internal_fn

2023-05-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Fri, 12 May 2023, Richard Sandiford wrote:
>
>> Richard Biener  writes:
>> > On Fri, 12 May 2023, Andre Vieira (lists) wrote:
>> >
>> >> I have dealt with, I think..., most of your comments. There's quite a few
>> >> changes, I think it's all a bit simpler now. I made some other changes to 
>> >> the
>> >> costing in tree-inline.cc and gimple-range-op.cc in which I try to 
>> >> preserve
>> >> the same behaviour as we had with the tree codes before. Also added some 
>> >> extra
>> >> checks to tree-cfg.cc that made sense to me.
>> >> 
>> >> I am still regression testing the gimple-range-op change, as that was a 
>> >> last
>> >> minute change, but the rest survived a bootstrap and regression test on
>> >> aarch64-unknown-linux-gnu.
>> >> 
>> >> cover letter:
>> >> 
>> >> This patch replaces the existing tree_code widen_plus and widen_minus
>> >> patterns with internal_fn versions.
>> >> 
>> >> DEF_INTERNAL_OPTAB_WIDENING_HILO_FN and 
>> >> DEF_INTERNAL_OPTAB_NARROWING_HILO_FN
>> >> are like DEF_INTERNAL_SIGNED_OPTAB_FN and DEF_INTERNAL_OPTAB_FN 
>> >> respectively
>> >> except they provide convenience wrappers for defining conversions that 
>> >> require
>> >> a hi/lo split.  Each definition for  will require optabs for _hi 
>> >> and _lo
>> >> and each of those will also require a signed and unsigned version in the 
>> >> case
>> >> of widening. The hi/lo pair is necessary because the widening and 
>> >> narrowing
>> >> operations take n narrow elements as inputs and return n/2 wide elements 
>> >> as
>> >> outputs. The 'lo' operation operates on the first n/2 elements of input. 
>> >> The
>> >> 'hi' operation operates on the second n/2 elements of input. Defining an
>> >> internal_fn along with hi/lo variations allows a single internal function 
>> >> to
>> >> be returned from a vect_recog function that will later be expanded to 
>> >> hi/lo.
>> >> 
>> >> 
>> >>  For example:
>> >>  IFN_VEC_WIDEN_PLUS -> IFN_VEC_WIDEN_PLUS_HI, IFN_VEC_WIDEN_PLUS_LO
>> >> for aarch64: IFN_VEC_WIDEN_PLUS_HI   -> vec_widen_add_hi_ ->
>> >> (u/s)addl2
>> >>IFN_VEC_WIDEN_PLUS_LO  -> 
>> >> vec_widen_add_lo_
>> >> -> (u/s)addl
>> >> 
>> >> This gives the same functionality as the previous WIDEN_PLUS/WIDEN_MINUS 
>> >> tree
>> >> codes which are expanded into VEC_WIDEN_PLUS_LO, VEC_WIDEN_PLUS_HI.
>> >
>> > What I still don't understand is how we are so narrowly focused on
>> > HI/LO?  We need a combined scalar IFN for pattern selection (not
>> > sure why that's now called _HILO, I expected no suffix).  Then there's
>> > three possibilities the target can implement this:
>> >
>> >  1) with a widen_[su]add instruction - I _think_ that's what
>> > RISCV is going to offer since it is a target where vector modes
>> > have "padding" (aka you cannot subreg a V2SI to get V4HI).  Instead
>> > RVV can do a V4HI to V4SI widening and widening add/subtract
>> > using vwadd[u] and vwsub[u] (the HI->SI widening is actually
>> > done with a widening add of zero - eh).
>> > IIRC GCN is the same here.
>> 
>> SVE currently does this too, but the addition and widening are
>> separate operations.  E.g. in principle there's no reason why
>> you can't sign-extend one operand, zero-extend the other, and
>> then add the result together.  Or you could extend them from
>> different sizes (QI and HI).  All of those are supported
>> (if the costing allows them).
>
> I see.  So why does the target the expose widen_[su]add at all?

It shouldn't (need to) do that.  I don't think we should have an optab
for the unsplit operation.

At least on SVE, we really want the extensions to be fused with loads
(where possible) rather than with arithmetic.

We can still do the widening arithmetic in one go.  It's just that
fusing with the loads works for the mixed-sign and mixed-size cases,
and can handle more than just doubling the element size.

>> If the target has operations to do combined extending and adding (or
>> whatever), then at the moment we rely on combine to generate them.
>> 
>> So I think this case is separate from Andre's work.  The addition
>> itself is just an ordinary addition, and any widening happens by
>> vectorising a CONVERT/NOP_EXPR.
>> 
>> >  2) with a widen_[su]add{_lo,_hi} combo - that's what the tree
>> > codes currently support (exclusively)
>> >  3) similar, but widen_[su]add{_even,_odd}
>> >
>> > that said, things like decomposes_to_hilo_fn_p look to paint us into
>> > a 2) corner without good reason.
>> 
>> I suppose one question is: how much of the patch is really specific
>> to HI/LO, and how much is just grouping two halves together?
>
> Yep, that I don't know for sure.
>
>>  The nice
>> thing about the internal-fn grouping macros is that, if (3) is
>> implemented in future, the structure will strongly encourage even/odd
>> pairs to be supported for all operations that support hi/lo.  That is,
>> I would expect the grouping macros to be extended to define 

Re: [PATCH] aarch64: Add SVE instruction types

2023-05-15 Thread Richard Sandiford via Gcc-patches
Evandro Menezes via Gcc-patches  writes:
> This patch adds the attribute `type` to most SVE1 instructions, as in the 
> other
> instructions.

Thanks for doing this.

Could you say what criteria you used for picking the granularity?  Other
maintainers might disagree, but personally I'd prefer to distinguish two
instructions only if:

(a) a scheduling description really needs to distinguish them or
(b) grouping them together would be very artificial (because they're
logically unrelated)

It's always possible to split types later if new scheduling descriptions
require it.  Because of that, I don't think we should try to predict ahead
of time what future scheduling descriptions will need.

Of course, this depends on having results that show that scheduling
makes a significant difference on an SVE core.  I think one of the
problems here is that, when a different scheduling model changes the
performance of a particular test, it's difficult to tell whether
the gain/loss is caused by the model being more/less accurate than
the previous one, or if it's due to important "secondary" effects
on register live ranges.  Instinctively, I'd have expected these
secondary effects to dominate on OoO cores.

Richard

>
> --
> Evandro Menezes
>
>
>
> From be61df66d1a86bc7ec415eb23504002831c67c51 Mon Sep 17 00:00:00 2001
> From: Evandro Menezes 
> Date: Mon, 8 May 2023 17:39:10 -0500
> Subject: [PATCH 2/3] aarch64: Add SVE instruction types
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-sve.md: Use the instruction types.
>   * config/arm/types.md: (sve_loop_p, sve_loop_ps, sve_loop_gs,
> sve_loop_end, sve_logic_p, sve_logic_ps, sve_cnt_p,
> sve_cnt_pv, sve_cnt_pvx, sve_rev_p, sve_sel_p, sve_set_p,
> sve_set_ps, sve_trn_p, sve_upk_p, sve_zip_p, sve_arith,
> sve_arith_r, sve_arith_sat, sve_arith_sat_x, sve_arith_x,
> sve_logic, sve_logic_r, sve_logic_x, sve_shift, sve_shift_d,
> sve_shift_dx, sve_shift_x, sve_compare_s, sve_cnt, sve_cnt_x,
> sve_copy, sve_copy_g, sve_move, sve_move_x, sve_move_g,
> sve_permute, sve_splat, sve_splat_m, sve_splat_g, sve_cext,
> sve_cext_x, sve_cext_g, sve_ext, sve_ext_x, sve_sext,
> sve_sext_x, sve_uext, sve_uext_x, sve_index, sve_index_g,
> sve_ins, sve_ins_x, sve_ins_g, sve_ins_gx, sve_rev, sve_rev_x,
> sve_tbl, sve_trn, sve_upk, sve_zip, sve_int_to_fp,
> sve_int_to_fp_x, sve_fp_to_int, sve_fp_to_int_x, sve_fp_to_fp,
> sve_fp_to_fp_x, sve_fp_round, sve_fp_round_x, sve_bf_to_fp,
> sve_bf_to_fp_x, sve_div, sve_div_x, sve_dot, sve_dot_x,
> sve_mla, sve_mla_x, sve_mmla, sve_mmla_x, sve_mul, sve_mul_x,
> sve_prfx, sve_fp_arith, sve_fp_arith_a, sve_fp_arith_c,
> sve_fp_arith_cx, sve_fp_arith_r, sve_fp_arith_x,
> sve_fp_compare, sve_fp_copy, sve_fp_move, sve_fp_move_x,
> sve_fp_div_d, sve_fp_div_dx, sve_fp_div_s, sve_fp_div_sx
> sve_fp_dot, sve_fp_mla, sve_fp_mla_x, sve_fp_mla_c,
> sve_fp_mla_cx, sve_fp_mla_t, sve_fp_mla_tx, sve_fp_mmla,
> sve_fp_mmla_x, sve_fp_mul, sve_fp_mul_x, sve_fp_sqrt_d,
> sve_fp_sqrt_dx, sve_fp_sqrt_s, sve_fp_sqrt_sx, sve_fp_trig,
> sve_fp_trig_x, sve_fp_estimate, sve_fp_step, sve_bf_dot,
> sve_bf_dot_x, sve_bf_mla, sve_bf_mla_x, sve_bf_mmla,
> sve_bf_mmla_x, sve_ldr, sve_ldr_p, sve_load1,
> sve_load1_gather_d, sve_load1_gather_dl, sve_load1_gather_du,
> sve_load1_gather_s, sve_load1_gather_sl, sve_load1_gather_su,
> sve_load2, sve_load3, sve_load4, sve_str, sve_str_p,
> sve_store1, sve_store1_scatter, sve_store2, sve_store3,
> sve_store4, sve_rd_ffr, sve_rd_ffr_p, sve_rd_ffr_ps,
> sve_wr_ffr): New types.
>
> Signed-off-by: Evandro Menezes 
> ---
>  gcc/config/aarch64/aarch64-sve.md | 632 ++
>  gcc/config/arm/types.md   | 342 
>  2 files changed, 819 insertions(+), 155 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 2898b85376b..58c5cb2ddbc 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -699,6 +699,7 @@
> str\t%1, %0
> mov\t%0.d, %1.d
> * return aarch64_output_sve_mov_immediate (operands[1]);"
> +  [(set_attr "type" "sve_ldr, sve_str, sve_move, *")] 
>  )
>  
>  ;; Unpredicated moves that cannot use LDR and STR, i.e. partial vectors
> @@ -714,6 +715,7 @@
>"@
> mov\t%0.d, %1.d
> * return aarch64_output_sve_mov_immediate (operands[1]);"
> +  [(set_attr "type" "sve_move, sve_move_x")]
>  )
>  
>  ;; Handle memory reloads for modes that can't use LDR and STR.  We use
> @@ -758,6 +760,8 @@
>"&& register_operand (operands[0], mode)
> && register_operand (operands[2], mode)"
>[(set (match_dup 0) (match_dup 2))]
> +  ""
> +  [(set_attr "type" "sve_load1, sve_store1, *")]
>  )
>  
>  ;; A pattern for optimizing 

Re: [PATCH 2/3] Refactor widen_plus as internal_fn

2023-05-12 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Fri, 12 May 2023, Andre Vieira (lists) wrote:
>
>> I have dealt with, I think..., most of your comments. There's quite a few
>> changes, I think it's all a bit simpler now. I made some other changes to the
>> costing in tree-inline.cc and gimple-range-op.cc in which I try to preserve
>> the same behaviour as we had with the tree codes before. Also added some 
>> extra
>> checks to tree-cfg.cc that made sense to me.
>> 
>> I am still regression testing the gimple-range-op change, as that was a last
>> minute change, but the rest survived a bootstrap and regression test on
>> aarch64-unknown-linux-gnu.
>> 
>> cover letter:
>> 
>> This patch replaces the existing tree_code widen_plus and widen_minus
>> patterns with internal_fn versions.
>> 
>> DEF_INTERNAL_OPTAB_WIDENING_HILO_FN and DEF_INTERNAL_OPTAB_NARROWING_HILO_FN
>> are like DEF_INTERNAL_SIGNED_OPTAB_FN and DEF_INTERNAL_OPTAB_FN respectively
>> except they provide convenience wrappers for defining conversions that 
>> require
>> a hi/lo split.  Each definition for  will require optabs for _hi and 
>> _lo
>> and each of those will also require a signed and unsigned version in the case
>> of widening. The hi/lo pair is necessary because the widening and narrowing
>> operations take n narrow elements as inputs and return n/2 wide elements as
>> outputs. The 'lo' operation operates on the first n/2 elements of input. The
>> 'hi' operation operates on the second n/2 elements of input. Defining an
>> internal_fn along with hi/lo variations allows a single internal function to
>> be returned from a vect_recog function that will later be expanded to hi/lo.
>> 
>> 
>>  For example:
>>  IFN_VEC_WIDEN_PLUS -> IFN_VEC_WIDEN_PLUS_HI, IFN_VEC_WIDEN_PLUS_LO
>> for aarch64: IFN_VEC_WIDEN_PLUS_HI   -> vec_widen_add_hi_ ->
>> (u/s)addl2
>>IFN_VEC_WIDEN_PLUS_LO  -> vec_widen_add_lo_
>> -> (u/s)addl
>> 
>> This gives the same functionality as the previous WIDEN_PLUS/WIDEN_MINUS tree
>> codes which are expanded into VEC_WIDEN_PLUS_LO, VEC_WIDEN_PLUS_HI.
>
> What I still don't understand is how we are so narrowly focused on
> HI/LO?  We need a combined scalar IFN for pattern selection (not
> sure why that's now called _HILO, I expected no suffix).  Then there's
> three possibilities the target can implement this:
>
>  1) with a widen_[su]add instruction - I _think_ that's what
> RISCV is going to offer since it is a target where vector modes
> have "padding" (aka you cannot subreg a V2SI to get V4HI).  Instead
> RVV can do a V4HI to V4SI widening and widening add/subtract
> using vwadd[u] and vwsub[u] (the HI->SI widening is actually
> done with a widening add of zero - eh).
> IIRC GCN is the same here.

SVE currently does this too, but the addition and widening are
separate operations.  E.g. in principle there's no reason why
you can't sign-extend one operand, zero-extend the other, and
then add the result together.  Or you could extend them from
different sizes (QI and HI).  All of those are supported
(if the costing allows them).

If the target has operations to do combined extending and adding (or
whatever), then at the moment we rely on combine to generate them.

So I think this case is separate from Andre's work.  The addition
itself is just an ordinary addition, and any widening happens by
vectorising a CONVERT/NOP_EXPR.

>  2) with a widen_[su]add{_lo,_hi} combo - that's what the tree
> codes currently support (exclusively)
>  3) similar, but widen_[su]add{_even,_odd}
>
> that said, things like decomposes_to_hilo_fn_p look to paint us into
> a 2) corner without good reason.

I suppose one question is: how much of the patch is really specific
to HI/LO, and how much is just grouping two halves together?  The nice
thing about the internal-fn grouping macros is that, if (3) is
implemented in future, the structure will strongly encourage even/odd
pairs to be supported for all operations that support hi/lo.  That is,
I would expect the grouping macros to be extended to define even/odd
ifns alongside hi/lo ones, rather than adding separate definitions
for even/odd functions.

If so, at least from the internal-fn.* side of things, I think the question
is whether it's OK to stick with hilo names for now, or whether we should
use more forward-looking names.

Thanks,
Richard

>
> Richard.
>
>> gcc/ChangeLog:
>> 
>> 2023-05-12  Andre Vieira  
>> Joel Hutton  
>> Tamar Christina  
>> 
>> * config/aarch64/aarch64-simd.md (vec_widen_addl_lo_):
>> Rename
>> this ...
>> (vec_widen_add_lo_): ... to this.
>> (vec_widen_addl_hi_): Rename this ...
>> (vec_widen_add_hi_): ... to this.
>> (vec_widen_subl_lo_): Rename this ...
>> (vec_widen_sub_lo_): ... to this.
>> (vec_widen_subl_hi_): Rename this ...
>> (vec_widen_sub_hi_): ...to this.
>> * doc/generic.texi: Document new IFN codes.
>>  * 

Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer

2023-05-12 Thread Richard Sandiford via Gcc-patches
"juzhe.zhong"  writes:
> Hi, Richard.  For "can iterate more than once", is it correct use the 
> condition
> "LOOP_LENS ().length >1". 

No, that says whether any LOAD_LENs or STORE_LENs operate on multiple
vectors, rather than just single vectors.

I meant: whether the vector loop body might be executed more than once
(i.e. whether the branch-back condition can be true).

This is true for a scalar loop that goes from 0 to some unbounded
variable n.  It's false for a scalar loop that goes from 0 to 6,
if the vectors are known to have at least 8 elements.

Thanks,
Richard

>  Replied Message 
>
> From  Richard Sandiford
>
> Date  05/12/2023 19:39
>
> Tojuzhe.zhong
>
> Ccgcc-patches@gcc.gnu.org,
>   kito.ch...@gmail.com,
>   pal...@dabbelt.com,
>   richard.guent...@gmail.com
>
> Subject   Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer
>
> "juzhe.zhong"  writes:
>> Thanks Richard.
>>  I will do that as you suggested. I have a question for the first patch. How
> to
>> enable decrement IV? Should I add a target hook or something to let target
>> decide whether enable decrement IV?
>
> At the moment, the only other targets that use IFN_LOAD_LEN and
> IFN_STORE_LEN are PowerPC and s390.  Both targets default to
> --param vect-partial-vector-usage=1 (i.e. use partial vectors
> for epilogues only).
>
> So I think the condition should be that the loop:
>
>  (a) uses length "controls"; and
>  (b) can iterate more than once
>
> No target checks should be needed.
>
> Thanks,
> Richard
>
>>  Replied Message 
>>
>> From  Richard Sandiford
>>
>> Date  05/12/2023 19:08
>>
>> Tojuzhe.zh...@rivai.ai
>>
>> Ccgcc-patches@gcc.gnu.org,
>>   kito.ch...@gmail.com,
>>   pal...@dabbelt.com,
>>   richard.guent...@gmail.com
>>
>> Subject   Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer
>>
>> juzhe.zh...@rivai.ai writes:
>>> From: Ju-Zhe Zhong 
>>>
>>> 1. Fix document description according Jeff && Richard.
>>> 2. Add LOOP_VINFO_USING_SELECT_VL_P for single rgroup.
>>> 3. Add LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P for SLP multiple rgroup.
>>>
>>> Fix bugs for V5 after testing:
>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618209.html
>>>
>>> gcc/ChangeLog:
>>>
>>> * doc/md.texi: Add seletc_vl pattern.
>>> * internal-fn.def (SELECT_VL): New ifn.
>>> * optabs.def (OPTAB_D): New optab.
>>> * tree-vect-loop-manip.cc (vect_adjust_loop_lens): New function.
>>> (vect_set_loop_controls_by_select_vl): Ditto.
>>> (vect_set_loop_condition_partial_vectors): Add loop control for
>> decrement IV.
>>> * tree-vect-loop.cc (vect_get_loop_len): Adjust loop len for SLP.
>>> * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): New function.
>>> (vectorizable_store): Support data reference IV added by outcome of
>> SELECT_VL.
>>> (vectorizable_load): Ditto.
>>> * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): New macro.
>>> (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P): Ditto.
>>> (vect_get_loop_len): Adjust loop len for SLP.
>>>
>>> ---
>>>  gcc/doc/md.texi |  36 
>>>  gcc/internal-fn.def |   1 +
>>>  gcc/optabs.def  |   1 +
>>>  gcc/tree-vect-loop-manip.cc | 380 +++-
>>>  gcc/tree-vect-loop.cc   |  31 ++-
>>>  gcc/tree-vect-stmts.cc  |  79 +++-
>>>  gcc/tree-vectorizer.h   |  12 +-
>>>  7 files changed, 526 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>>> index 8ebce31ba78..a94ffc4456d 100644
>>> --- a/gcc/doc/md.texi
>>> +++ b/gcc/doc/md.texi
>>> @@ -4974,6 +4974,42 @@ for (i = 1; i < operand3; i++)
>>>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>>>  @end smallexample
>>>  
>>> +@cindex @code{select_vl@var{m}} instruction pattern
>>> +@item @code{select_vl@var{m}}
>>> +Set operand 0 to the number of active elements in a vector to be updated
>>> +in a loop iteration based on the total number of elements to be updated,
>>> +the vectorization factor and vector properties of the target.
>>> +operand 1 is the total elements in the vector to be updated.
>>> +operand 2 is the vectorization factor.
>>> +The value of operand 0 is target dependent and flexible in each iteration.
>>> +The operation of this pattern can be:
>>> +
>>> +@smallexample
>>> +Case 1:
>>> +operand0 = MIN (operand1, operand2);
>>> +operand2 can be const_poly_int or poly_int related to vector mode size.
>>> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE
>> SIZE) so
>>> +that we can reduce a use of general purpose register.
>>> +
>>> +In this case, only the last iteration of the loop is partial iteration.
>>> +@end smallexample
>>> +
>>> +@smallexample
>>> +Case 2:
>>> +if (operand1 <= operand2)
>>> +  operand0 = operand1;
>>> +else if (operand1 < 2 * operand2)
>>> +  

Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer

2023-05-12 Thread Richard Sandiford via Gcc-patches
"juzhe.zhong"  writes:
> Thanks Richard.
>  I will do that as you suggested. I have a question for the first patch. How 
> to
> enable decrement IV? Should I add a target hook or something to let target
> decide whether enable decrement IV?

At the moment, the only other targets that use IFN_LOAD_LEN and
IFN_STORE_LEN are PowerPC and s390.  Both targets default to
--param vect-partial-vector-usage=1 (i.e. use partial vectors
for epilogues only).

So I think the condition should be that the loop:

  (a) uses length "controls"; and
  (b) can iterate more than once

No target checks should be needed.

Thanks,
Richard

>  Replied Message 
>
> From  Richard Sandiford
>
> Date  05/12/2023 19:08
>
> Tojuzhe.zh...@rivai.ai
>
> Ccgcc-patches@gcc.gnu.org,
>   kito.ch...@gmail.com,
>   pal...@dabbelt.com,
>   richard.guent...@gmail.com
>
> Subject   Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer
>
> juzhe.zh...@rivai.ai writes:
>> From: Ju-Zhe Zhong 
>>
>> 1. Fix document description according Jeff && Richard.
>> 2. Add LOOP_VINFO_USING_SELECT_VL_P for single rgroup.
>> 3. Add LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P for SLP multiple rgroup.
>>
>> Fix bugs for V5 after testing:
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618209.html
>>
>> gcc/ChangeLog:
>>
>> * doc/md.texi: Add seletc_vl pattern.
>> * internal-fn.def (SELECT_VL): New ifn.
>> * optabs.def (OPTAB_D): New optab.
>> * tree-vect-loop-manip.cc (vect_adjust_loop_lens): New function.
>> (vect_set_loop_controls_by_select_vl): Ditto.
>> (vect_set_loop_condition_partial_vectors): Add loop control for
> decrement IV.
>> * tree-vect-loop.cc (vect_get_loop_len): Adjust loop len for SLP.
>> * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): New function.
>> (vectorizable_store): Support data reference IV added by outcome of
> SELECT_VL.
>> (vectorizable_load): Ditto.
>> * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): New macro.
>> (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P): Ditto.
>> (vect_get_loop_len): Adjust loop len for SLP.
>>
>> ---
>>  gcc/doc/md.texi |  36 
>>  gcc/internal-fn.def |   1 +
>>  gcc/optabs.def  |   1 +
>>  gcc/tree-vect-loop-manip.cc | 380 +++-
>>  gcc/tree-vect-loop.cc   |  31 ++-
>>  gcc/tree-vect-stmts.cc  |  79 +++-
>>  gcc/tree-vectorizer.h   |  12 +-
>>  7 files changed, 526 insertions(+), 14 deletions(-)
>>
>> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>> index 8ebce31ba78..a94ffc4456d 100644
>> --- a/gcc/doc/md.texi
>> +++ b/gcc/doc/md.texi
>> @@ -4974,6 +4974,42 @@ for (i = 1; i < operand3; i++)
>>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>>  @end smallexample
>>  
>> +@cindex @code{select_vl@var{m}} instruction pattern
>> +@item @code{select_vl@var{m}}
>> +Set operand 0 to the number of active elements in a vector to be updated
>> +in a loop iteration based on the total number of elements to be updated,
>> +the vectorization factor and vector properties of the target.
>> +operand 1 is the total elements in the vector to be updated.
>> +operand 2 is the vectorization factor.
>> +The value of operand 0 is target dependent and flexible in each iteration.
>> +The operation of this pattern can be:
>> +
>> +@smallexample
>> +Case 1:
>> +operand0 = MIN (operand1, operand2);
>> +operand2 can be const_poly_int or poly_int related to vector mode size.
>> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE
> SIZE) so
>> +that we can reduce a use of general purpose register.
>> +
>> +In this case, only the last iteration of the loop is partial iteration.
>> +@end smallexample
>> +
>> +@smallexample
>> +Case 2:
>> +if (operand1 <= operand2)
>> +  operand0 = operand1;
>> +else if (operand1 < 2 * operand2)
>> +  operand0 = ceil (operand1 / 2);
>> +else
>> +  operand0 = operand2;
>> +
>> +This case will evenly distribute work over the last 2 iterations of a
> stripmine loop.
>> +@end smallexample
>> +
>> +The output of this pattern is not only used as IV of loop control counter,
> but also
>> +is used as the IV of address calculation with multiply/shift operation. This
> allows
>> +dynamic adjustment of the number of elements processed each loop iteration.
>> +
>
> I don't think we need to restrict the definition to the two RVV cases.
> How about:
>
> ---
> Set operand 0 to the number of scalar iterations that should be handled
> by one iteration of a vector loop.  Operand 1 is the total number of
> scalar iterations that the loop needs to process and operand 2 is a
> maximum bound on the result (also known as the maximum ``vectorization
> factor'').
>
> The maximum value of operand 0 is given by:
> @smallexample
> operand0 = MIN (operand1, operand2)
> @end smallexample
> 

Re: [PATCH] Machine_Mode: Extend machine_mode from 8 to 16 bits

2023-05-12 Thread Richard Sandiford via Gcc-patches
"Li, Pan2 via Gcc-patches"  writes:
> Thanks Richard for comments. In previous, I am not sure it is reasonable to 
> let everywhere consume the same macro in rtl.h (As the includes you 
> mentioned). Thus, make a conservative change in PATCH v1.
>
> I will address the comments and try to align the bit size to the one and the 
> only one macro soon.

Sorry, I should have thought about this earlier, but it would
probably make sense to name the macro MACHINE_MODE_BITSIZE and
define it in machmode.h rather than rtl.h.  (The rtx_code stuff
should stay as-is.)

Thanks,
Richard

>
> Pan
>
>
> -Original Message-
> From: Richard Sandiford  
> Sent: Friday, May 12, 2023 4:24 PM
> To: Li, Pan2 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@sifive.com; 
> Wang, Yanzhang ; jeffreya...@gmail.com; 
> rguent...@suse.de
> Subject: Re: [PATCH] Machine_Mode: Extend machine_mode from 8 to 16 bits
>
> pan2...@intel.com writes:
>> From: Pan Li 
>>
>> We are running out of the machine_mode(8 bits) in RISC-V backend. Thus 
>> we would like to extend the machine mode bit size from 8 to 16 bits.
>> However, it is sensitive to extend the memory size in common structure 
>> like tree or rtx. This patch would like to extend the machine mode 
>> bits to 16 bits by shrinking, like:
>>
>> * Swap the bit size of code and machine code in rtx_def.
>> * Reconcile the machine_mode location and spare in tree.
>>
>> The memory impact of this patch for correlated structure looks like below:
>>
>> +---+--+-+--+
>> | struct/bytes  | upstream | patched | diff |
>> +---+--+-+--+
>> | rtx_obj_reference |8 |  12 |   +4 |
>> | ext_modified  |2 |   3 |   +1 |
>> | ira_allocno   |  192 | 200 |   +8 |
>> | qty_table_elem|   40 |  40 |0 |
>> | reg_stat_type |   64 |  64 |0 |
>> | rtx_def   |   40 |  40 |0 |
>> | table_elt |   80 |  80 |0 |
>> | tree_decl_common  |  112 | 112 |0 |
>> | tree_type_common  |  128 | 128 |0 |
>> +---+--+-+--+
>>
>> The tree and rtx related struct has no memory changes after this 
>> patch, and the machine_mode changes to 16 bits already.
>>
>> Signed-off-by: Pan Li 
>> Co-authored-by: Ju-Zhe Zhong 
>> Co-authored-by: Kito Cheng 
>>
>> gcc/ChangeLog:
>>
>>  * combine.cc (struct reg_stat_type): Extended machine mode to 16 bits.
>>  * cse.cc (struct qty_table_elem): Ditto.
>>  (struct table_elt): Ditto.
>>  (struct set): Ditto.
>>  * genopinit.cc (main): Reconciled the machine mode limit.
>>  * ira-int.h (struct ira_allocno): Extended machine mode to 16 bits.
>>  * ree.cc (struct ATTRIBUTE_PACKED): Ditto.
>>  * rtl-ssa/accesses.h: Ditto.
>>  * rtl.h (RTX_CODE_BITSIZE): New macro.
>>  (RTX_MACHINE_MODE_BITSIZE): Ditto.
>>  (struct GTY): Swap bit size between code and machine mode.
>>  (subreg_shape::unique_id): Reconciled the machine mode limit.
>>  * rtlanal.h: Extended machine mode to 16 bits.
>>  * tree-core.h (struct tree_type_common): Ditto.
>>  (struct tree_decl_common): Reconciled the locate and extended
>>  bit size of machine mode.
>> ---
>>  gcc/combine.cc |  4 ++--
>>  gcc/cse.cc |  8 
>>  gcc/genopinit.cc   |  3 ++-
>>  gcc/ira-int.h  | 12 
>>  gcc/ree.cc |  2 +-
>>  gcc/rtl-ssa/accesses.h |  6 --
>>  gcc/rtl.h  |  9 ++---
>>  gcc/rtlanal.h  |  5 +++--
>>  gcc/tree-core.h| 11 ---
>>  9 files changed, 38 insertions(+), 22 deletions(-)
>>
>> diff --git a/gcc/combine.cc b/gcc/combine.cc index 
>> 5aa0ec5c45a..bdf6f635c80 100644
>> --- a/gcc/combine.cc
>> +++ b/gcc/combine.cc
>> @@ -200,7 +200,7 @@ struct reg_stat_type {
>>  
>>unsigned HOST_WIDE_INTlast_set_nonzero_bits;
>>char  last_set_sign_bit_copies;
>> -  ENUM_BITFIELD(machine_mode)   last_set_mode : 8;
>> +  ENUM_BITFIELD(machine_mode)   last_set_mode : 
>> RTX_MACHINE_MODE_BITSIZE;
>>  
>>/* Set nonzero if references to register n in expressions should not be
>>   used.  last_set_invalid is set nonzero when this register is 
>> being @@ -235,7 +235,7 @@ struct reg_stat_type {
>>   truncation if we know that value already contains a truncated
>>   value.  */
>>  
>> -  ENUM_BITFIELD(machine_mode)   truncated_to_mode : 8;
>> +  ENUM_BITFIELD(machine_mode)   truncated_to_mode : 
>> RTX_MACHINE_MODE_BITSIZE;
>>  };
>>  
>>  
>> diff --git a/gcc/cse.cc b/gcc/cse.cc
>> index b10c9b0c94d..fe594c1bc3d 100644
>> --- a/gcc/cse.cc
>> +++ b/gcc/cse.cc
>> @@ -250,8 +250,8 @@ struct qty_table_elem
>>unsigned int first_reg, last_reg;
>>/* The sizes of these fields should match the sizes of the
>>   code and mode fields of struct rtx_def (see rtl.h).  */
>
> The 

Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer

2023-05-12 Thread Richard Sandiford via Gcc-patches
juzhe.zh...@rivai.ai writes:
> From: Ju-Zhe Zhong 
>
> 1. Fix document description according Jeff && Richard.
> 2. Add LOOP_VINFO_USING_SELECT_VL_P for single rgroup.
> 3. Add LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P for SLP multiple rgroup.
>
> Fix bugs for V5 after testing:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618209.html
>
> gcc/ChangeLog:
>
> * doc/md.texi: Add seletc_vl pattern.
> * internal-fn.def (SELECT_VL): New ifn.
> * optabs.def (OPTAB_D): New optab.
> * tree-vect-loop-manip.cc (vect_adjust_loop_lens): New function.
> (vect_set_loop_controls_by_select_vl): Ditto.
> (vect_set_loop_condition_partial_vectors): Add loop control for 
> decrement IV.
> * tree-vect-loop.cc (vect_get_loop_len): Adjust loop len for SLP.
> * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): New function.
> (vectorizable_store): Support data reference IV added by outcome of 
> SELECT_VL.
> (vectorizable_load): Ditto.
> * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): New macro.
> (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P): Ditto.
> (vect_get_loop_len): Adjust loop len for SLP.
>
> ---
>  gcc/doc/md.texi |  36 
>  gcc/internal-fn.def |   1 +
>  gcc/optabs.def  |   1 +
>  gcc/tree-vect-loop-manip.cc | 380 +++-
>  gcc/tree-vect-loop.cc   |  31 ++-
>  gcc/tree-vect-stmts.cc  |  79 +++-
>  gcc/tree-vectorizer.h   |  12 +-
>  7 files changed, 526 insertions(+), 14 deletions(-)
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 8ebce31ba78..a94ffc4456d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,42 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in a vector to be updated 
> +in a loop iteration based on the total number of elements to be updated, 
> +the vectorization factor and vector properties of the target.
> +operand 1 is the total elements in the vector to be updated.
> +operand 2 is the vectorization factor.
> +The value of operand 0 is target dependent and flexible in each iteration.
> +The operation of this pattern can be:
> +
> +@smallexample
> +Case 1:
> +operand0 = MIN (operand1, operand2);
> +operand2 can be const_poly_int or poly_int related to vector mode size.
> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE 
> SIZE) so
> +that we can reduce a use of general purpose register.
> +
> +In this case, only the last iteration of the loop is partial iteration.
> +@end smallexample
> +
> +@smallexample
> +Case 2:
> +if (operand1 <= operand2)
> +  operand0 = operand1;
> +else if (operand1 < 2 * operand2)
> +  operand0 = ceil (operand1 / 2);
> +else
> +  operand0 = operand2;
> +
> +This case will evenly distribute work over the last 2 iterations of a 
> stripmine loop.
> +@end smallexample
> +
> +The output of this pattern is not only used as IV of loop control counter, 
> but also
> +is used as the IV of address calculation with multiply/shift operation. This 
> allows
> +dynamic adjustment of the number of elements processed each loop iteration.
> +

I don't think we need to restrict the definition to the two RVV cases.
How about:

---
Set operand 0 to the number of scalar iterations that should be handled
by one iteration of a vector loop.  Operand 1 is the total number of
scalar iterations that the loop needs to process and operand 2 is a
maximum bound on the result (also known as the maximum ``vectorization
factor'').

The maximum value of operand 0 is given by:
@smallexample
operand0 = MIN (operand1, operand2)
@end smallexample
However, targets might choose a lower value than this, based on
target-specific criteria.  Each iteration of the vector loop might
therefore process a different number of scalar iterations, which in turn
means that induction variables will have a variable step.  Because of
this, it is generally not useful to define this instruction if it will
always calculate the maximum value.

This optab is only useful on targets that implement @samp{len_load_@var{m}}
and/or @samp{len_store_@var{m}}.
---

>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 7fe742c2ae7..6f6fa7d37f9 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -153,6 +153,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
>  DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
>  
>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, 

Re: [PATCH] Machine_Mode: Extend machine_mode from 8 to 16 bits

2023-05-12 Thread Richard Sandiford via Gcc-patches
pan2...@intel.com writes:
> From: Pan Li 
>
> We are running out of the machine_mode(8 bits) in RISC-V backend. Thus
> we would like to extend the machine mode bit size from 8 to 16 bits.
> However, it is sensitive to extend the memory size in common structure
> like tree or rtx. This patch would like to extend the machine mode bits
> to 16 bits by shrinking, like:
>
> * Swap the bit size of code and machine code in rtx_def.
> * Reconcile the machine_mode location and spare in tree.
>
> The memory impact of this patch for correlated structure looks like below:
>
> +---+--+-+--+
> | struct/bytes  | upstream | patched | diff |
> +---+--+-+--+
> | rtx_obj_reference |8 |  12 |   +4 |
> | ext_modified  |2 |   3 |   +1 |
> | ira_allocno   |  192 | 200 |   +8 |
> | qty_table_elem|   40 |  40 |0 |
> | reg_stat_type |   64 |  64 |0 |
> | rtx_def   |   40 |  40 |0 |
> | table_elt |   80 |  80 |0 |
> | tree_decl_common  |  112 | 112 |0 |
> | tree_type_common  |  128 | 128 |0 |
> +---+--+-+--+
>
> The tree and rtx related struct has no memory changes after this patch,
> and the machine_mode changes to 16 bits already.
>
> Signed-off-by: Pan Li 
> Co-authored-by: Ju-Zhe Zhong 
> Co-authored-by: Kito Cheng 
>
> gcc/ChangeLog:
>
>   * combine.cc (struct reg_stat_type): Extended machine mode to 16 bits.
>   * cse.cc (struct qty_table_elem): Ditto.
>   (struct table_elt): Ditto.
>   (struct set): Ditto.
>   * genopinit.cc (main): Reconciled the machine mode limit.
>   * ira-int.h (struct ira_allocno): Extended machine mode to 16 bits.
>   * ree.cc (struct ATTRIBUTE_PACKED): Ditto.
>   * rtl-ssa/accesses.h: Ditto.
>   * rtl.h (RTX_CODE_BITSIZE): New macro.
>   (RTX_MACHINE_MODE_BITSIZE): Ditto.
>   (struct GTY): Swap bit size between code and machine mode.
>   (subreg_shape::unique_id): Reconciled the machine mode limit.
>   * rtlanal.h: Extended machine mode to 16 bits.
>   * tree-core.h (struct tree_type_common): Ditto.
>   (struct tree_decl_common): Reconciled the locate and extended
>   bit size of machine mode.
> ---
>  gcc/combine.cc |  4 ++--
>  gcc/cse.cc |  8 
>  gcc/genopinit.cc   |  3 ++-
>  gcc/ira-int.h  | 12 
>  gcc/ree.cc |  2 +-
>  gcc/rtl-ssa/accesses.h |  6 --
>  gcc/rtl.h  |  9 ++---
>  gcc/rtlanal.h  |  5 +++--
>  gcc/tree-core.h| 11 ---
>  9 files changed, 38 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index 5aa0ec5c45a..bdf6f635c80 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -200,7 +200,7 @@ struct reg_stat_type {
>  
>unsigned HOST_WIDE_INT last_set_nonzero_bits;
>char   last_set_sign_bit_copies;
> -  ENUM_BITFIELD(machine_mode)last_set_mode : 8;
> +  ENUM_BITFIELD(machine_mode)last_set_mode : 
> RTX_MACHINE_MODE_BITSIZE;
>  
>/* Set nonzero if references to register n in expressions should not be
>   used.  last_set_invalid is set nonzero when this register is being
> @@ -235,7 +235,7 @@ struct reg_stat_type {
>   truncation if we know that value already contains a truncated
>   value.  */
>  
> -  ENUM_BITFIELD(machine_mode)truncated_to_mode : 8;
> +  ENUM_BITFIELD(machine_mode)truncated_to_mode : 
> RTX_MACHINE_MODE_BITSIZE;
>  };
>  
>  
> diff --git a/gcc/cse.cc b/gcc/cse.cc
> index b10c9b0c94d..fe594c1bc3d 100644
> --- a/gcc/cse.cc
> +++ b/gcc/cse.cc
> @@ -250,8 +250,8 @@ struct qty_table_elem
>unsigned int first_reg, last_reg;
>/* The sizes of these fields should match the sizes of the
>   code and mode fields of struct rtx_def (see rtl.h).  */

The comment can be removed, since you're now adding macros to ensure
this (thanks).  Same for other instances of the comment.

> -  ENUM_BITFIELD(rtx_code) comparison_code : 16;
> -  ENUM_BITFIELD(machine_mode) mode : 8;
> +  ENUM_BITFIELD(rtx_code) comparison_code : RTX_CODE_BITSIZE;
> +  ENUM_BITFIELD(machine_mode) mode : RTX_MACHINE_MODE_BITSIZE;

Please put the mode first, so that the 16-bit value is aligned
to 16 bits.

>  };
>  
>  /* The table of all qtys, indexed by qty number.  */
> @@ -406,7 +406,7 @@ struct table_elt
>int regcost;
>/* The size of this field should match the size
>   of the mode field of struct rtx_def (see rtl.h).  */
> -  ENUM_BITFIELD(machine_mode) mode : 8;
> +  ENUM_BITFIELD(machine_mode) mode : RTX_MACHINE_MODE_BITSIZE;
>char in_memory;
>char is_const;
>char flag;
> @@ -4155,7 +4155,7 @@ struct set
>/* Original machine mode, in case it becomes a CONST_INT.
>   The size of this field should match the size of the mode
>   

Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-11 Thread Richard Sandiford via Gcc-patches
Tejas Belagod  writes:
> From: Tejas Belagod 
>
>   This PR optimizes an SVE intrinsics sequence where
> svlasta (svptrue_pat_b8 (SV_VL1), x)
>   a scalar is selected based on a constant predicate and a variable vector.
>   This sequence is optimized to return the correspoding element of a NEON
>   vector. For eg.
> svlasta (svptrue_pat_b8 (SV_VL1), x)
>   returns
> umovw0, v0.b[1]
>   Likewise,
> svlastb (svptrue_pat_b8 (SV_VL1), x)
>   returns
>  umovw0, v0.b[0]
>   This optimization only works provided the constant predicate maps to a range
>   that is within the bounds of a 128-bit NEON register.
>
> gcc/ChangeLog:
>
>   PR target/96339
>   * config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold 
> sve
>   calls that have a constant input predicate vector.
>   (svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
>   (svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
>   (svlast_impl::vect_all_same): Check if all vector elements are equal.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/96339
>   * gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
>   * gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
>   * gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
>   * gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
>   to expect optimized code for function body.
>   * gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc  | 124 +++
>  .../aarch64/sve/acle/general-c/svlast.c   |  63 
>  .../sve/acle/general-c/svlast128_run.c| 313 +
>  .../sve/acle/general-c/svlast256_run.c| 314 ++
>  .../gcc.target/aarch64/sve/pcs/return_4.c |   2 -
>  .../aarch64/sve/pcs/return_4_1024.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_128.c |   2 -
>  .../aarch64/sve/pcs/return_4_2048.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_512.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5.c |   2 -
>  .../aarch64/sve/pcs/return_5_1024.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_128.c |   2 -
>  .../aarch64/sve/pcs/return_5_2048.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_512.c |   2 -
>  16 files changed, 814 insertions(+), 24 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index cd9cace3c9b..db2b4dcaac9 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -1056,6 +1056,130 @@ class svlast_impl : public quiet
>  public:
>CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {}
>  
> +  bool is_lasta () const { return m_unspec == UNSPEC_LASTA; }
> +  bool is_lastb () const { return m_unspec == UNSPEC_LASTB; }
> +
> +  bool vect_all_same (tree v , int step) const

Nit: stray space after "v".

> +  {
> +int i;
> +int nelts = vector_cst_encoded_nelts (v);
> +int first_el = 0;
> +
> +for (i = first_el; i < nelts; i += step)
> +  if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, 
> first_el))

I think this should use !operand_equal_p (..., ..., 0).

> + return false;
> +
> +return true;
> +  }
> +
> +  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
> + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
> + the index of the element being accessed is in the range of a NEON vector
> + width.  */

s/NEON/Advanced SIMD/.  Same in later comments

> +  gimple *fold (gimple_folder & f) const override
> +  {
> +tree pred = gimple_call_arg (f.call, 0);
> +tree val = gimple_call_arg (f.call, 1);
> 

Re: [aarch64] Code-gen for vector initialization involving constants

2023-05-11 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:

> On Tue, 2 May 2023 at 18:22, Richard Sandiford
>  wrote:
>>
>> Prathamesh Kulkarni  writes:
>> > On Tue, 2 May 2023 at 17:32, Richard Sandiford
>> >  wrote:
>> >>
>> >> Prathamesh Kulkarni  writes:
>> >> > On Tue, 2 May 2023 at 14:56, Richard Sandiford
>> >> >  wrote:
>> >> >> > [aarch64] Improve code-gen for vector initialization with single 
>> >> >> > constant element.
>> >> >> >
>> >> >> > gcc/ChangeLog:
>> >> >> >   * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak 
>> >> >> > condition
>> >> >> >   if (n_var == n_elts && n_elts <= 16) to allow a single 
>> >> >> > constant,
>> >> >> >   and if maxv == 1, use constant element for duplicating into 
>> >> >> > register.
>> >> >> >
>> >> >> > gcc/testsuite/ChangeLog:
>> >> >> >   * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >> >> >
>> >> >> > diff --git a/gcc/config/aarch64/aarch64.cc 
>> >> >> > b/gcc/config/aarch64/aarch64.cc
>> >> >> > index 2b0de7ca038..f46750133a6 100644
>> >> >> > --- a/gcc/config/aarch64/aarch64.cc
>> >> >> > +++ b/gcc/config/aarch64/aarch64.cc
>> >> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx 
>> >> >> > vals)
>> >> >> >   and matches[X][1] with the count of duplicate elements (if X 
>> >> >> > is the
>> >> >> >   earliest element which has duplicates).  */
>> >> >> >
>> >> >> > -  if (n_var == n_elts && n_elts <= 16)
>> >> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>> >> >> >  {
>> >> >> >int matches[16][2] = {0};
>> >> >> >for (int i = 0; i < n_elts; i++)
>> >> >> > @@ -7,6 +7,18 @@ aarch64_expand_vector_init (rtx target, rtx 
>> >> >> > vals)
>> >> >> >vector register.  For big-endian we want that position to 
>> >> >> > hold
>> >> >> >the last element of VALS.  */
>> >> >> > maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> >> >> > +
>> >> >> > +   /* If we have a single constant element, use that for 
>> >> >> > duplicating
>> >> >> > +  instead.  */
>> >> >> > +   if (n_var == n_elts - 1)
>> >> >> > + for (int i = 0; i < n_elts; i++)
>> >> >> > +   if (CONST_INT_P (XVECEXP (vals, 0, i))
>> >> >> > +   || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> >> >> > + {
>> >> >> > +   maxelement = i;
>> >> >> > +   break;
>> >> >> > + }
>> >> >> > +
>> >> >> > rtx x = force_reg (inner_mode, XVECEXP (vals, 0, 
>> >> >> > maxelement));
>> >> >> > aarch64_emit_move (target, lowpart_subreg (mode, x, 
>> >> >> > inner_mode));
>> >> >>
>> >> >> We don't want to force the constant into a register though.
>> >> > OK right, sorry.
>> >> > With the attached patch, for the following test-case:
>> >> > int64x2_t f_s64(int64_t x)
>> >> > {
>> >> >   return (int64x2_t) { x, 1 };
>> >> > }
>> >> >
>> >> > it loads constant from memory (same code-gen as without patch).
>> >> > f_s64:
>> >> > adrpx1, .LC0
>> >> > ldr q0, [x1, #:lo12:.LC0]
>> >> > ins v0.d[0], x0
>> >> > ret
>> >> >
>> >> > Does the patch look OK ?
>> >> >
>> >> > Thanks,
>> >> > Prathamesh
>> >> > [...]
>> >> > [aarch64] Improve code-gen for vector initialization with single 
>> >> > constant element.
>> >> >
>> >> > gcc/ChangeLog:
>> >> >   * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak 
>> >> > condition
>> >> >   if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >> >   and if maxv == 1, use constant element for duplicating into 
>> >> > register.
>> >> >
>> >> > gcc/testsuite/ChangeLog:
>> >> >   * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >> >
>> >> > diff --git a/gcc/config/aarch64/aarch64.cc 
>> >> > b/gcc/config/aarch64/aarch64.cc
>> >> > index 2b0de7ca038..97309ddec4f 100644
>> >> > --- a/gcc/config/aarch64/aarch64.cc
>> >> > +++ b/gcc/config/aarch64/aarch64.cc
>> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx 
>> >> > vals)
>> >> >   and matches[X][1] with the count of duplicate elements (if X is 
>> >> > the
>> >> >   earliest element which has duplicates).  */
>> >> >
>> >> > -  if (n_var == n_elts && n_elts <= 16)
>> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>> >>
>> >> No need for the extra brackets.
>> > Adjusted, thanks. Sorry if this sounds like a silly question, but why
>> > do we need the n_elts <= 16 check ?
>> > Won't n_elts be always <= 16 since max number of elements in a vector
>> > would be 16 for V16QI ?
>>
>> Was wondering the same thing :)
>>
>> Let's leave it though.
>>
>> >> >  {
>> >> >int matches[16][2] = {0};
>> >> >for (int i = 0; i < n_elts; i++)
>> >> > @@ -7,8 +7,26 @@ aarch64_expand_vector_init (rtx target, rtx 
>> >> > vals)
>> >> >vector register.  For big-endian we want that position to 
>> >> > hold
>> >> >the last element of VALS.  */
>> >> > maxelement 

Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

2023-05-11 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-18.c 
> b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> new file mode 100644
> index 000..598a51f17c6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +
> +#include 
> +
> +int16x8_t foo(int16_t x, int y)
> +{
> +  int16x8_t v = (int16x8_t) {x, y, x, y, x, y, x, y}; 
> +  return v;
> +}
> +
> +int16x8_t foo2(int16_t x) 
> +{
> +  int16x8_t v = (int16x8_t) {x, 1, x, 1, x, 1, x, 1}; 
> +  return v;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tdup\tv[0-9]+\.4h, w[0-9]+} 3 } } */
> +/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.4h, 0x1} } } */
> +/* { dg-final { scan-assembler {\tzip1\tv[0-9]+\.8h, v[0-9]+\.8h, 
> v[0-9]+\.8h} } } */

Would be good to make this a scan-assembler-times ... 2.

OK with that change.  Thanks for doing this.

Richard


Re: [libgcc PATCH] Add bit reversal functions __bitrev[qhsd]i2.

2023-05-11 Thread Richard Sandiford via Gcc-patches
"Roger Sayle"  writes:
> This patch proposes adding run-time library support for bit reversal,
> by adding a __bitrevsi2 function to libgcc.  Thoughts/opinions?
>
> I'm also tempted to add __popcount[qh]i2 and __parity[qh]i2 to libgcc,
> to allow the RTL optimizers to perform narrowing operations, but I'm
> curious to hear whether QImode and HImode support, though more efficient,
> is frowned by the libgcc maintainers/philosophy.

I don't think RTL optimisers should be in the business of generating new
libcalls.  Wouldn't it have to be done in gimple and/or during expand?

> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32} and
> on nvptx-none, with no new regressions.  Ok for mainline?
>
>
> 2023-05-06  Roger Sayle  
>
> gcc/ChangeLog
> * doc/libgcc.texi (__bitrevqi2): Document bit reversal run-time
> functions; __bitrevqi2, __bitrevhi2, __bitrevsi2 and __bitrevdi2.
>
> libgcc/ChangeLog
> * Makfile.in (lib2funcs): Add __bitrev[qhsd]i2.
> * libgcc-std.ver.in (GCC_14.0.0): Add __bitrev[qhsd]i2.
> * libgcc2.c (__bitrevqi2): New function.
> (__bitrevhi2): Likewise.
> (__bitrevsi2): Likewise.
> (__bitrevdi2): Likewise.
> * libgcc2.h (__bitrevqi2): Prototype here.
> (__bitrevhi2): Likewise.
> (__bitrevsi2): Likewise.
> (__bitrevdi2): Likewise.
>
> Thanks in advance,
> Roger
> --
>
> diff --git a/gcc/doc/libgcc.texi b/gcc/doc/libgcc.texi
> index 73aa803..7611347 100644
> --- a/gcc/doc/libgcc.texi
> +++ b/gcc/doc/libgcc.texi
> @@ -218,6 +218,13 @@ These functions return the number of bits set in @var{a}.
>  These functions return the @var{a} byteswapped.
>  @end deftypefn
>  
> +@deftypefn {Runtime Function} int8_t __bitrevqi2 (int8_t @var{a})
> +@deftypefnx {Runtime Function} int16_t __bitrevhi2 (int16_t @var{a})
> +@deftypefnx {Runtime Function} int32_t __bitrevsi2 (int32_t @var{a})
> +@deftypefnx {Runtime Function} int64_t __bitrevdi2 (int64_t @var{a})
> +These functions return the bit reversed @var{a}.
> +@end deftypefn
> +
>  @node Soft float library routines
>  @section Routines for floating point emulation
>  @cindex soft float library
> diff --git a/libgcc/Makefile.in b/libgcc/Makefile.in
> index 6c4dc79..67c54df 100644
> --- a/libgcc/Makefile.in
> +++ b/libgcc/Makefile.in
> @@ -446,7 +446,7 @@ lib2funcs = _muldi3 _negdi2 _lshrdi3 _ashldi3 _ashrdi3 
> _cmpdi2 _ucmpdi2  \
>   _paritysi2 _paritydi2 _powisf2 _powidf2 _powixf2 _powitf2  \
>   _mulhc3 _mulsc3 _muldc3 _mulxc3 _multc3 _divhc3 _divsc3\
>   _divdc3 _divxc3 _divtc3 _bswapsi2 _bswapdi2 _clrsbsi2  \
> - _clrsbdi2
> + _clrsbdi2 _bitrevqi2 _bitrevhi2 _bitrevsi2 _bitrevdi2
>  
>  # The floating-point conversion routines that involve a single-word integer.
>  # XX stands for the integer mode.
> diff --git a/libgcc/libgcc-std.ver.in b/libgcc/libgcc-std.ver.in
> index c4f87a5..2198b0e 100644
> --- a/libgcc/libgcc-std.ver.in
> +++ b/libgcc/libgcc-std.ver.in
> @@ -1944,3 +1944,12 @@ GCC_7.0.0 {
>__PFX__divmoddi4
>__PFX__divmodti4
>  }
> +
> +%inherit GCC_14.0.0 GCC_7.0.0
> +GCC_14.0.0 {
> +  # bit reversal functions
> +  __PFX__bitrevqi2
> +  __PFX__bitrevhi2
> +  __PFX__bitrevsi2
> +  __PFX__bitrevdi2
> +}
> diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
> index e0017d1..2bef2a1 100644
> --- a/libgcc/libgcc2.c
> +++ b/libgcc/libgcc2.c
> @@ -488,6 +488,54 @@ __bswapdi2 (DItype u)
> | (((u) & 0x00ffull) << 56));
>  }
>  #endif
> +
> +#ifdef L_bitrevqi2
> +QItype
> +__bitrevqi2 (QItype x)
> +{
> +  UQItype u = x;
> +  u = (((u) >> 1) & 0x55) | (((u) & 0x55) << 1);
> +  u = (((u) >> 2) & 0x33) | (((u) & 0x33) << 2);
> +  return ((u) >> 4) | ((u) << 4);
> +}
> +#endif
> +#ifdef L_bitrevhi2
> +HItype
> +__bitrevhi2 (HItype x)
> +{
> +  UHItype u = x;
> +  u = (((u) >> 1) & 0x) | (((u) & 0x) << 1);
> +  u = (((u) >> 2) & 0x) | (((u) & 0x) << 2);
> +  u = (((u) >> 4) & 0x0f0f) | (((u) & 0x0f0f) << 4);
> +  return ((u) >> 8) | ((u) << 8);
> +}
> +#endif
> +#ifdef L_bitrevsi2
> +SItype
> +__bitrevsi2 (SItype x)
> +{
> +  USItype u = x;
> +  u = (((u) >> 1) & 0x) | (((u) & 0x) << 1);
> +  u = (((u) >> 2) & 0x) | (((u) & 0x) << 2);
> +  u = (((u) >> 4) & 0x0f0f0f0f) | (((u) & 0x0f0f0f0f) << 4);
> +  return __bswapsi2 (u);

Would it be better to use __builtin_bswap32 here, so that targets
with bswap but not bitreverse still optimise the bswap part?
Same for the DI version.

Not sure how portable all this is, but the underlying assumptions
seem to be the same as for bswap.

Looks OK to me otherwise, but it should wait until something needs it
(and can test it).

Thanks,
Richard

> +}
> +#endif
> +#ifdef L_bitrevdi2
> +DItype
> +__bitrevdi2 (DItype x)
> +{
> +  UDItype u = x;
> +  u = (((u) >> 1) & 0xll)
> +  | (((u) & 

Re: [PATCH] Add RTX codes for BITREVERSE and COPYSIGN.

2023-05-11 Thread Richard Sandiford via Gcc-patches
"Roger Sayle"  writes:
> An analysis of backend UNSPECs reveals that two of the most common UNSPECs
> across target backends are for copysign and bit reversal.  This patch
> adds RTX codes for these expressions to allow their representation to
> be standardized, and them to optimized by the middle-end RTL optimizers.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-32} with
> no new failures.  Ok for mainline?
>
>
> 2023-05-06  Roger Sayle  
>
> gcc/ChangeLog
> * doc/rtl.texi (bitreverse, copysign): Document new RTX codes.
> * rtl.def (BITREVERSE, COPYSIGN): Define new RTX codes.
> * simplify-rtx.cc (simplify_unary_operation_1): Optimize
> NOT (BITREVERSE x) as BITREVERSE (NOT x).
> Optimize POPCOUNT (BITREVERSE x) as POPCOUNT x.
> Optimize PARITY (BITREVERSE x) as PARITY x.
> Optimize BITREVERSE (BITREVERSE x) as x.
> (simplify_const_unary_operation) : Evaluate
> BITREVERSE of a constant integer at compile-time.
> (simplify_binary_operation_1) :  Optimize
> COPY_SIGN (x, x) as x.  Optimize COPYSIGN (x, C) as ABS x
> or NEG (ABS x) for constant C.  Optimize COPYSIGN (ABS x, y)
> and COPYSIGN (NEG x, y) as COPYSIGN (x, y).  Optimize
> COPYSIGN (x, ABS y) as ABS x.
> Optimize COPYSIGN (COPYSIGN (x, y), z) as COPYSIGN (x, z).
> Optimize COPYSIGN (x, COPYSIGN (y, z)) as COPYSIGN (x, z).
> (simplify_const_binary_operation): Evaluate COPYSIGN of constant
> arguments at compile-time.
> * wide-int.cc (wide_int_storage::bitreverse): Provide a
> wide_int implementation, based upon bswap implementation.
> * wide-int.g (wide_int_storage::bitreverse): Prototype here.
>
>
> Thanks in advance,
> Roger
> --
>
> diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> index 1de2494..76aeafb 100644
> --- a/gcc/doc/rtl.texi
> +++ b/gcc/doc/rtl.texi
> @@ -2742,6 +2742,17 @@ integer of mode @var{m}.  The mode of @var{x} must be 
> @var{m} or
>  Represents the value @var{x} with the order of bytes reversed, carried out
>  in mode @var{m}, which must be a fixed-point machine mode.
>  The mode of @var{x} must be @var{m} or @code{VOIDmode}.
> +
> +@findex bitreverse
> +@item (bitreverse:@var{m} @var{x})
> +Represents the value @var{x} with the order of bits reversed, carried out
> +in mode @var{m}, which must be a fixed-point machine mode.
> +The mode of @var{x} must be @var{m} or @code{VOIDmode}.
> +
> +@findex copysign
> +@item (copysign:@var{m} @var{x} @var{y})
> +Represents the value @var{x} with the sign of @var{y}.
> +Both @var{x} and @var{y} must have floating point machine mode @var{m}.
>  @end table
>  
>  @node Comparisons
> diff --git a/gcc/rtl.def b/gcc/rtl.def
> index 6ddbce3..88e2b19 100644
> --- a/gcc/rtl.def
> +++ b/gcc/rtl.def
> @@ -664,6 +664,9 @@ DEF_RTL_EXPR(POPCOUNT, "popcount", "e", RTX_UNARY)
>  /* Population parity (number of 1 bits modulo 2).  */
>  DEF_RTL_EXPR(PARITY, "parity", "e", RTX_UNARY)
>  
> +/* Reverse bits.  */
> +DEF_RTL_EXPR(BITREVERSE, "bitreverse", "e", RTX_UNARY)
> +
>  /* Reference to a signed bit-field of specified size and position.
> Operand 0 is the memory unit (usually SImode or QImode) which
> contains the field's first bit.  Operand 1 is the width, in bits.
> @@ -753,6 +756,9 @@ DEF_RTL_EXPR(US_TRUNCATE, "us_truncate", "e", RTX_UNARY)
>  /* Floating point multiply/add combined instruction.  */
>  DEF_RTL_EXPR(FMA, "fma", "eee", RTX_TERNARY)
>  
> +/* Floating point copysign.  Operand 0 with the sign of operand 1.  */
> +DEF_RTL_EXPR(COPYSIGN, "copysign", "ee", RTX_BIN_ARITH)
> +
>  /* Information about the variable and its location.  */
>  DEF_RTL_EXPR(VAR_LOCATION, "var_location", "te", RTX_EXTRA)
>  
> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> index d4aeebc..26fa2b9 100644
> --- a/gcc/simplify-rtx.cc
> +++ b/gcc/simplify-rtx.cc
> @@ -1040,10 +1040,10 @@ simplify_context::simplify_unary_operation_1 
> (rtx_code code, machine_mode mode,
>   }
>  
>/* (not (bswap x)) -> (bswap (not x)).  */
> -  if (GET_CODE (op) == BSWAP)
> +  if (GET_CODE (op) == BSWAP || GET_CODE (op) == BITREVERSE)
>   {
> rtx x = simplify_gen_unary (NOT, mode, XEXP (op, 0), mode);
> -   return simplify_gen_unary (BSWAP, mode, x, mode);
> +   return simplify_gen_unary (GET_CODE (op), mode, x, mode);
>   }
>break;
>  
> @@ -1419,6 +1419,7 @@ simplify_context::simplify_unary_operation_1 (rtx_code 
> code, machine_mode mode,
>switch (GET_CODE (op))
>   {
>   case BSWAP:
> + case BITREVERSE:
> /* (popcount (bswap )) = (popcount ).  */
> return simplify_gen_unary (POPCOUNT, mode, XEXP (op, 0),
>GET_MODE (XEXP (op, 0)));
> @@ -1448,6 +1449,7 @@ simplify_context::simplify_unary_operation_1 (rtx_code 
> code, machine_mode mode,

Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Thanks. I have read rgroup descriptions again.
> Still I am not fully understand it clearly, bear with me :)
>
> I don't known how to differentiate Case 2 and Case 3.
>
> Case 2 is multiple rgroup for SLP.
> Case 3 is multiple rgroup for non-SLP (VEC_PACK_TRUNC)
>
> Is it correct:
> case 2: rgc->max_nscalarper_iter != 1

Yes.

> Case 3 : rgc->max_nscalarper_iter == 1 but rgc->factor != 1?

For case 3 it's:

rgc->max_nscalars_per_iter == 1 && rgc != _VINFO_LENS (loop_vinfo)[0]

rgc->factor is controlled by the target and just says what units
IFN_LOAD_LEN works in.  E.g. if we're loading 16-byte elements,
but the underlying instruction measures bytes, the factor would be 2.

Thanks,
Richard


[PATCH] aarch64: Remove alignment assertions [PR109661]

2023-05-11 Thread Richard Sandiford via Gcc-patches
The trunk patch for this PR corrected the ABI for enums that have
a defined underlying type.  We shouldn't change the ABI on the branches
though, so this patch just removes the assertions that highlighed
the problem.

I think the same approach makes sense longer-term: keep the assertions
at maximum strength in trunk, and in any new branches that get cut.
Then, if the assertions trip an ABI problem, fix the problem in trunk
and remove the assertions from active branches.

The tests are the same as for the trunk version, but with all Wpsabi
message and expected output checks removed.

Tested on aarch64-linux-gnu & pushed to GCC 13.  I'll do a similar
patch for GCC 12.

Richard


gcc/
PR target/109661
* config/aarch64/aarch64.cc (aarch64_function_arg_alignment): Remove
assertion.
(aarch64_layout_arg): Likewise.

gcc/testsuite/
PR target/109661
* g++.target/aarch64/pr109661-1.C: New test.
* g++.target/aarch64/pr109661-2.C: Likewise.
* g++.target/aarch64/pr109661-3.C: Likewise.
* g++.target/aarch64/pr109661-4.C: Likewise.
* gcc.target/aarch64/pr109661-1.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc |   5 -
 gcc/testsuite/g++.target/aarch64/pr109661-1.C | 122 +
 gcc/testsuite/g++.target/aarch64/pr109661-2.C | 123 ++
 gcc/testsuite/g++.target/aarch64/pr109661-3.C | 123 ++
 gcc/testsuite/g++.target/aarch64/pr109661-4.C | 123 ++
 gcc/testsuite/gcc.target/aarch64/pr109661-1.c |   5 +
 6 files changed, 496 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/aarch64/pr109661-1.C
 create mode 100644 gcc/testsuite/g++.target/aarch64/pr109661-2.C
 create mode 100644 gcc/testsuite/g++.target/aarch64/pr109661-3.C
 create mode 100644 gcc/testsuite/g++.target/aarch64/pr109661-4.C
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr109661-1.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 0f04ab9fba0..f5db5379543 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -7495,7 +7495,6 @@ aarch64_function_arg_alignment (machine_mode mode, 
const_tree type,
  gcc_assert (known_eq (POINTER_SIZE, GET_MODE_BITSIZE (mode)));
  return POINTER_SIZE;
}
-  gcc_assert (!TYPE_USER_ALIGN (type));
   return TYPE_ALIGN (type);
 }
 
@@ -7714,10 +7713,6 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
function_arg_info )
 = aarch64_function_arg_alignment (mode, type, _break,
  _break_packed);
 
-  gcc_assert ((allocate_nvrn || alignment <= 16 * BITS_PER_UNIT)
- && (!alignment || abi_break < alignment)
- && (!abi_break_packed || alignment < abi_break_packed));
-
   /* allocate_ncrn may be false-positive, but allocate_nvrn is quite reliable.
  The following code thus handles passing by SIMD/FP registers first.  */
 
diff --git a/gcc/testsuite/g++.target/aarch64/pr109661-1.C 
b/gcc/testsuite/g++.target/aarch64/pr109661-1.C
new file mode 100644
index 000..c579834358b
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/pr109661-1.C
@@ -0,0 +1,122 @@
+/* { dg-options "-O2 -Wpsabi" } */
+
+#include 
+
+#define ALIGN
+
+typedef __uint128_t u128_4 __attribute__((aligned(4)));
+typedef __uint128_t u128_8 __attribute__((aligned(8)));
+typedef __uint128_t u128_16 __attribute__((aligned(16)));
+typedef __uint128_t u128_32 __attribute__((aligned(32)));
+typedef __uint128_t u128;
+
+typedef __UINT64_TYPE__ u64_4 __attribute__((aligned(4)));
+typedef __UINT64_TYPE__ u64_8 __attribute__((aligned(8)));
+typedef __UINT64_TYPE__ u64_16 __attribute__((aligned(16)));
+typedef __UINT64_TYPE__ u64_32 __attribute__((aligned(32)));
+typedef __UINT64_TYPE__ u64;
+
+enum class ALIGN e128_4 : u128_4 { A };
+enum class ALIGN e128_8 : u128_8 { A };
+enum class ALIGN e128_16 : u128_16 { A };
+enum class ALIGN e128_32 : u128_32 { A };
+enum class ALIGN e128 : u128 { A };
+
+enum class ALIGN e64_4 : u64_4 { A };
+enum class ALIGN e64_8 : u64_8 { A };
+enum class ALIGN e64_16 : u64_16 { A };
+enum class ALIGN e64_32 : u64_32 { A };
+enum class ALIGN e64 : u64 { A };
+
+extern "C" {
+
+e128_4 reg_e128_4 (int x, e128_4 y) { return y; }
+
+e128_8 reg_e128_8 (int x, e128_8 y) { return y; }
+
+e128_16 reg_e128_16 (int x, e128_16 y) { return y; }
+
+e128_32 reg_e128_32 (int x, e128_32 y) { return y; }
+
+e128 reg_e128 (int x, e128 y) { return y; }
+
+e64_4 reg_e64_4 (int x, e64_4 y) { return y; }
+
+e64_8 reg_e64_8 (int x, e64_8 y) { return y; }
+
+e64_16 reg_e64_16 (int x, e64_16 y) { return y; }
+
+e64_32 reg_e64_32 (int x, e64_32 y) { return y; }
+
+e64 reg_e64 (int x, e64 y) { return y; }
+
+e128_4 stack_e128_4 (u128 x0, u128 x2, u128 x4, u128 x6, int x, e128_4 y) { 
return y; }
+
+e128_8 stack_e128_8 (u128 x0, u128 x2, u128 x4, u128 x6, int x, e128_8 y) { 
return y; }
+
+e128_16 stack_e128_16 (u128 x0, u128 x2, u128 

Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Oh, I see. But I saw there is a variable using_partial_vectors_p
> in the loop data structure.
>
> Can I add a variable call using_select_vl_p ?

Yeah.  Please also add a wrapper macro like
LOOP_VINFO_USING_PARTIAL_VECTORS_P.  (I'm not really a fan of the
wrappers, but it's better to be consistent.)

> Since it may increase the size of data structure, I am not sure whether it is 
> appropriate.

The structure is only temporary, and very few of them exist at
a given time.  Besides, there's already a layout hole on LP64 hosts
around those booleans (between slp_unrolling_factor and scalar_loop).
So the new boolean shouldn't grow the size of the structure.

We can convert the booleans to bitfields if size ever becomes a problem.

Thanks,
Richard


Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Hi, Richard.  Since create_iv has been approved and soon will be commited 
> after
> we bootstrap && regression.
>
> Now, I plan to send patch for "decrement IV".
>
> After reading your comments, I have several questions:
>
> 1. 
>>if (use_bias_adjusted_len)
>>  return rgl->bias_adjusted_ctrl;
>> +  else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
>> +OPTIMIZE_FOR_SPEED))
>> +{
>> +  tree loop_len = rgl->controls[index];
>> +  poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type);
>> +  poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype);
>> +  if (maybe_ne (nunits1, nunits2))
>> + {
>> +   /* A loop len for data type X can be reused for data type Y
>> +  if X has N times more elements than Y and if Y's elements
>> +  are N times bigger than X's.  */
>> +   gcc_assert (multiple_p (nunits1, nunits2));
>> +   unsigned int factor = exact_div (nunits1, nunits2).to_constant ();
>> +   gimple_seq seq = NULL;
>> +   loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len,
>> +build_int_cst (iv_type, factor));
>> +   if (seq)
>> + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
>> + }
>> +  return loop_len;
>> +}
>>else
>>  return rgl->controls[index];
>>  }
>
>>  ...here.  That is, the key isn't whether SELECT_VL is available,
>>  but instead whether we've decided to use it for this loop (unless
>>  I'm missing something).
>
> Let's me clarify it again:
>
> I do this here is for Case 2 SLP:
>
> Generate for len : _61 = _75 / 2;
> I think it is similar with ARM SVE using VIEW_CONVER_EXPR to view_convert the 
> mask.
>
> You said we should not let SELECT_VL is available or not to decide it here.
> Could you teach me how to handle this code here? Should I add a target hook 
> like:
> TARGET_SLP_LOOP_LEN_RDIV_BY_FACTOR_P ?

No.  What I mean is: for each vectorised loop, we should make a decision,
in one place only, whether to use SELECT_VL-based control flow or
arithmetic-based control flow for that particular loop.  That decision
depends partly on direct_internal_fn_supported_p (a necessary but not
sufficient condition), partly on whether the loop contains SLP nodes, etc.
We should then record that decision in the loop_vec_info so that it is
available to whichever code needs it.

This is similar to LOOP_VINFO_USING_PARTIAL_VECTORS_P etc.

Thanks,
Richard


Re: [PATCH 15/20] arm: [MVE intrinsics] add unary_acc shape

2023-05-11 Thread Richard Sandiford via Gcc-patches
Christophe Lyon  writes:
> On 5/11/23 10:30, Richard Sandiford wrote:
>> Christophe Lyon  writes:
>>> On 5/10/23 16:52, Kyrylo Tkachov wrote:


> -Original Message-
> From: Christophe Lyon 
> Sent: Wednesday, May 10, 2023 2:31 PM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw ; Richard Sandiford
> 
> Cc: Christophe Lyon 
> Subject: [PATCH 15/20] arm: [MVE intrinsics] add unary_acc shape
>
> This patch adds the unary_acc shape description.
>
> 2022-10-25  Christophe Lyon  
>
>   gcc/
>   * config/arm/arm-mve-builtins-shapes.cc (unary_acc): New.
>   * config/arm/arm-mve-builtins-shapes.h (unary_acc): New.
> ---
>gcc/config/arm/arm-mve-builtins-shapes.cc | 28 +++
>gcc/config/arm/arm-mve-builtins-shapes.h  |  1 +
>2 files changed, 29 insertions(+)
>
> diff --git a/gcc/config/arm/arm-mve-builtins-shapes.cc 
> b/gcc/config/arm/arm-
> mve-builtins-shapes.cc
> index bff1c3e843b..e77a0cc20ac 100644
> --- a/gcc/config/arm/arm-mve-builtins-shapes.cc
> +++ b/gcc/config/arm/arm-mve-builtins-shapes.cc
> @@ -1066,6 +1066,34 @@ struct unary_def : public overloaded_base<0>
>};
>SHAPE (unary)
>
> +/* _t vfoo[_](_t)
> +
> +   i.e. a version of "unary" in which the source elements are half the
> +   size of the destination scalar, but have the same type class.
> +
> +   Example: vaddlvq.
> +   int64_t [__arm_]vaddlvq[_s32](int32x4_t a)
> +   int64_t [__arm_]vaddlvq_p[_s32](int32x4_t a, mve_pred16_t p) */
> +struct unary_acc_def : public overloaded_base<0>
> +{
> +  void
> +  build (function_builder , const function_group_info ,
> +  bool preserve_user_namespace) const override
> +  {
> +b.add_overloaded_functions (group, MODE_none,
> preserve_user_namespace);
> +build_all (b, "sw0,v0", group, MODE_none, preserve_user_namespace);
> +  }
> +
> +  tree
> +  resolve (function_resolver ) const override
> +  {
> +/* FIXME: check that the return value is actually
> +   twice as wide as arg 0.  */

 Any reason why we can't add that check now?
 I'd rather not add new FIXMEs here...
>>>
>>> I understand :-)
>>>
>>> That's because the resolver only knows about the arguments, not the
>>> return value:
>>> /* The arguments to the overloaded function.  */
>>> vec _arglist;
>>>
>>> I kept this like what already exists for AArch64/SVE, but we'll need to
>>> extend it to handle return values too, so that we can support all
>>> overloaded forms of vuninitialized
>>> (see https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616003.html)
>>>
>>> I meant this extension to be a follow-up work when most intrinsics have
>>> been converted and the few remaining ones (eg. vuninitialized) needs an
>>> improved framework.  And that would enable to fix the FIXME.
>> 
>> We can't resolve based on the return type though.  It has to be
>> arguments only.  E.g.:
>> 
>> decltype(foo(a, b))
>> 
>> has to be well-defined, even though decltype (by design) provides no
>> context about "what the caller wants".
>> 
>
> So in fact we can probably get rid of (most of) the remaining 
> definitions of vuninitializedq in arm_mve.h, but not by looking at the 
> return type (re-reading this I'm wondering whether I overlooked this 
> when I started the series)
>
> But for things like vaddlvq, we can't check that the result is actually 
> written in a twice-as-large as the argument location?

No.  All we can/should do is to resolve the typeless builtin to a fully-typed
builtin, based on the argument types.  The return type of that fully-typed
builtin determines the type of the function call expression (the CALL_EXPR).
It's then up to the frontend to do semantic/type checking of the
resolved expression type.

In other words, information only flows in one direction:

  argument types -> function overloading -> function return type

Thanks,
Richard


Re: [PATCH 15/20] arm: [MVE intrinsics] add unary_acc shape

2023-05-11 Thread Richard Sandiford via Gcc-patches
Christophe Lyon  writes:
> On 5/10/23 16:52, Kyrylo Tkachov wrote:
>> 
>> 
>>> -Original Message-
>>> From: Christophe Lyon 
>>> Sent: Wednesday, May 10, 2023 2:31 PM
>>> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
>>> Richard Earnshaw ; Richard Sandiford
>>> 
>>> Cc: Christophe Lyon 
>>> Subject: [PATCH 15/20] arm: [MVE intrinsics] add unary_acc shape
>>>
>>> This patch adds the unary_acc shape description.
>>>
>>> 2022-10-25  Christophe Lyon  
>>>
>>> gcc/
>>> * config/arm/arm-mve-builtins-shapes.cc (unary_acc): New.
>>> * config/arm/arm-mve-builtins-shapes.h (unary_acc): New.
>>> ---
>>>   gcc/config/arm/arm-mve-builtins-shapes.cc | 28 +++
>>>   gcc/config/arm/arm-mve-builtins-shapes.h  |  1 +
>>>   2 files changed, 29 insertions(+)
>>>
>>> diff --git a/gcc/config/arm/arm-mve-builtins-shapes.cc b/gcc/config/arm/arm-
>>> mve-builtins-shapes.cc
>>> index bff1c3e843b..e77a0cc20ac 100644
>>> --- a/gcc/config/arm/arm-mve-builtins-shapes.cc
>>> +++ b/gcc/config/arm/arm-mve-builtins-shapes.cc
>>> @@ -1066,6 +1066,34 @@ struct unary_def : public overloaded_base<0>
>>>   };
>>>   SHAPE (unary)
>>>
>>> +/* _t vfoo[_](_t)
>>> +
>>> +   i.e. a version of "unary" in which the source elements are half the
>>> +   size of the destination scalar, but have the same type class.
>>> +
>>> +   Example: vaddlvq.
>>> +   int64_t [__arm_]vaddlvq[_s32](int32x4_t a)
>>> +   int64_t [__arm_]vaddlvq_p[_s32](int32x4_t a, mve_pred16_t p) */
>>> +struct unary_acc_def : public overloaded_base<0>
>>> +{
>>> +  void
>>> +  build (function_builder , const function_group_info ,
>>> +bool preserve_user_namespace) const override
>>> +  {
>>> +b.add_overloaded_functions (group, MODE_none,
>>> preserve_user_namespace);
>>> +build_all (b, "sw0,v0", group, MODE_none, preserve_user_namespace);
>>> +  }
>>> +
>>> +  tree
>>> +  resolve (function_resolver ) const override
>>> +  {
>>> +/* FIXME: check that the return value is actually
>>> +   twice as wide as arg 0.  */
>> 
>> Any reason why we can't add that check now?
>> I'd rather not add new FIXMEs here...
>
> I understand :-)
>
> That's because the resolver only knows about the arguments, not the 
> return value:
>/* The arguments to the overloaded function.  */
>vec _arglist;
>
> I kept this like what already exists for AArch64/SVE, but we'll need to 
> extend it to handle return values too, so that we can support all 
> overloaded forms of vuninitialized
> (see https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616003.html)
>
> I meant this extension to be a follow-up work when most intrinsics have 
> been converted and the few remaining ones (eg. vuninitialized) needs an 
> improved framework.  And that would enable to fix the FIXME.

We can't resolve based on the return type though.  It has to be
arguments only.  E.g.:

   decltype(foo(a, b))

has to be well-defined, even though decltype (by design) provides no
context about "what the caller wants".

Thanks,
Richard


Re: [PATCH V5] VECT: Add tree_code into "creat_iv" and allow it can handle MINUS_EXPR IV.

2023-05-11 Thread Richard Sandiford via Gcc-patches
juzhe.zh...@rivai.ai writes:
> From: Juzhe-Zhong 
>
> This is patch is a seperate patch preparing for supporting decrement IV.
>
> gcc/ChangeLog:
>
> * cfgloopmanip.cc (create_empty_loop_on_edge): Add PLUS_EXPR.
> * gimple-loop-interchange.cc 
> (tree_loop_interchange::map_inductions_to_loop): Ditto.
> * tree-ssa-loop-ivcanon.cc (create_canonical_iv): Ditto.
> * tree-ssa-loop-ivopts.cc (create_new_iv): Ditto.
> * tree-ssa-loop-manip.cc (create_iv): Ditto.
> (tree_transform_and_unroll_loop): Ditto.
> (canonicalize_loop_ivs): Ditto.
> * tree-ssa-loop-manip.h (create_iv): Ditto.
> * tree-vect-data-refs.cc (vect_create_data_ref_ptr): Ditto.
> * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto.
> (vect_set_loop_condition_normal): Ditto.
> * tree-vect-loop.cc (vect_create_epilog_for_reduction): Ditto.
> * tree-vect-stmts.cc (vectorizable_store): Ditto.
> (vectorizable_load): Ditto.
>
> ---
>  gcc/cfgloopmanip.cc|  2 +-
>  gcc/gimple-loop-interchange.cc |  2 +-
>  gcc/tree-ssa-loop-ivcanon.cc   |  2 +-
>  gcc/tree-ssa-loop-ivopts.cc|  2 +-
>  gcc/tree-ssa-loop-manip.cc | 18 +-
>  gcc/tree-ssa-loop-manip.h  |  4 ++--
>  gcc/tree-vect-data-refs.cc |  8 
>  gcc/tree-vect-loop-manip.cc|  7 ---
>  gcc/tree-vect-loop.cc  |  2 +-
>  gcc/tree-vect-stmts.cc |  4 ++--
>  10 files changed, 26 insertions(+), 25 deletions(-)
>
> diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
> index 0e3ad8ed742..6e09dcbb0b1 100644
> --- a/gcc/cfgloopmanip.cc
> +++ b/gcc/cfgloopmanip.cc
> @@ -826,7 +826,7 @@ create_empty_loop_on_edge (edge entry_edge,
>  }
>  
>gsi = gsi_last_bb (loop_header);
> -  create_iv (initial_value, stride, iv, loop, , false,
> +  create_iv (initial_value, PLUS_EXPR, stride, iv, loop, , false,
>iv_before, iv_after);
>  
>/* Insert loop exit condition.  */
> diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
> index 1b77bfd46b2..e5590374e59 100644
> --- a/gcc/gimple-loop-interchange.cc
> +++ b/gcc/gimple-loop-interchange.cc
> @@ -1185,7 +1185,7 @@ tree_loop_interchange::map_inductions_to_loop 
> (loop_cand , loop_cand )
> tree var_before, var_after;
> tree base = unshare_expr (iv->init_expr);
> tree step = unshare_expr (iv->step);
> -   create_iv (base, step, SSA_NAME_VAR (iv->var),
> +   create_iv (base, PLUS_EXPR, step, SSA_NAME_VAR (iv->var),
>tgt.m_loop, _pos, false, _before, _after);
> bitmap_set_bit (m_dce_seeds, SSA_NAME_VERSION (var_before));
> bitmap_set_bit (m_dce_seeds, SSA_NAME_VERSION (var_after));
> diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
> index f678de41cb0..6a962a9f503 100644
> --- a/gcc/tree-ssa-loop-ivcanon.cc
> +++ b/gcc/tree-ssa-loop-ivcanon.cc
> @@ -113,7 +113,7 @@ create_canonical_iv (class loop *loop, edge exit, tree 
> niter,
>  niter,
>  build_int_cst (type, 1));
>incr_at = gsi_last_bb (in->src);
> -  create_iv (niter,
> +  create_iv (niter, PLUS_EXPR,
>build_int_cst (type, -1),
>NULL_TREE, loop,
>_at, false, var_before, );
> diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> index 324703054b5..6fbd2d59318 100644
> --- a/gcc/tree-ssa-loop-ivopts.cc
> +++ b/gcc/tree-ssa-loop-ivopts.cc
> @@ -7267,7 +7267,7 @@ create_new_iv (struct ivopts_data *data, struct iv_cand 
> *cand)
>  
>base = unshare_expr (cand->iv->base);
>  
> -  create_iv (base, unshare_expr (cand->iv->step),
> +  create_iv (base, PLUS_EXPR, unshare_expr (cand->iv->step),
>cand->var_before, data->current_loop,
>_pos, after, >var_before, >var_after);
>  }
> diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc
> index 598e2189f6c..4a333ddf9e6 100644
> --- a/gcc/tree-ssa-loop-manip.cc
> +++ b/gcc/tree-ssa-loop-manip.cc
> @@ -57,16 +57,16 @@ static bitmap_obstack loop_renamer_obstack;
> VAR_AFTER (unless they are NULL).  */
>  
>  void
> -create_iv (tree base, tree step, tree var, class loop *loop,
> -gimple_stmt_iterator *incr_pos, bool after,
> -tree *var_before, tree *var_after)
> +create_iv (tree base, tree_code incr_op, tree step, tree var, class loop 
> *loop,
> +gimple_stmt_iterator *incr_pos, bool after, tree *var_before,
> +tree *var_after)

Looks like you've lost the hunk that updates the comment.  The one
from the previous patch:

> @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
> so that we can free them all at once.  */
> static bitmap_obstack loop_renamer_obstack;
> -/* Creates an induction variable with value BASE + STEP * iteration in LOOP.
> +/* Creates an induction variable with value BASE (+/-) STEP * iteration in 
> LOOP.
> +   If CODE is PLUS_EXPR, the 

Re: [PATCH v5] Var-Tracking: Typedef pointer_mux as decl_or_value

2023-05-11 Thread Richard Sandiford via Gcc-patches
pan2...@intel.com writes:
> From: Pan Li 
>
> The decl_or_value is defined as void * before this PATCH. It will take
> care of both the tree_node and rtx_def. Unfortunately, given a void
> pointer cannot tell the input is tree_node or rtx_def.
>
> Then we have some implicit structure layout requirement similar as
> below. Or we will touch unreasonable bits when cast void * to tree_node
> or rtx_def.
>
> ++---+--+
> | offset | tree_node | rtx_def  |
> ++---+--+
> |  0 | code: 16  | code: 16 | <- require the same location and bitssize
> ++---+--+
> | 16 | ...   | mode: 8  |
> ++---+--+
> | ...   |
> ++---+--+
> | 24 | ...   | ...  |
> ++---+--+
>
> This behavior blocks the PATCH that extend the rtx_def mode from 8 to
> 16 bits for running out of machine mode. This PATCH introduced the
> pointer_mux to tell the input is tree_node or rtx_def, and decouple
> the above implicit dependency.
>
> Signed-off-by: Pan Li 
> Co-Authored-By: Richard Sandiford 
> Co-Authored-By: Richard Biener 
> Co-Authored-By: Jakub Jelinek 
>
> gcc/ChangeLog:
>
>   * mux-utils.h: Add overload operator == and != for pointer_mux.
>   * var-tracking.cc: Included mux-utils.h for pointer_tmux.
>   (decl_or_value): Changed from void * to pointer_mux.
>   (dv_is_decl_p): Reconciled to the new type, aka pointer_mux.
>   (dv_as_decl): Ditto.
>   (dv_as_opaque): Removed due to unnecessary.
>   (struct variable_hasher): Take decl_or_value as compare_type.
>   (variable_hasher::equal): Diito.
>   (dv_from_decl): Reconciled to the new type, aka pointer_mux.
>   (dv_from_value): Ditto.
>   (attrs_list_member):  Ditto.
>   (vars_copy): Ditto.
>   (var_reg_decl_set): Ditto.
>   (var_reg_delete_and_set): Ditto.
>   (find_loc_in_1pdv): Ditto.
>   (canonicalize_values_star): Ditto.
>   (variable_post_merge_new_vals): Ditto.
>   (dump_onepart_variable_differences): Ditto.
>   (variable_different_p): Ditto.
>   (set_slot_part): Ditto.
>   (clobber_slot_part): Ditto.
>   (clobber_variable_part): Ditto.

OK, thanks!

Richard

> ---
>  gcc/mux-utils.h |  4 +++
>  gcc/var-tracking.cc | 85 ++---
>  2 files changed, 37 insertions(+), 52 deletions(-)
>
> diff --git a/gcc/mux-utils.h b/gcc/mux-utils.h
> index a2b6a316899..486d80915b1 100644
> --- a/gcc/mux-utils.h
> +++ b/gcc/mux-utils.h
> @@ -117,6 +117,10 @@ public:
>//  ...use ptr.known_second ()...
>T2 *second_or_null () const;
>  
> +  bool operator == (const pointer_mux ) const { return m_ptr == pm.m_ptr; 
> }
> +
> +  bool operator != (const pointer_mux ) const { return m_ptr != pm.m_ptr; 
> }
> +
>// Return true if the pointer is a T.
>//
>// This is only valid if T1 and T2 are distinct and if T can be
> diff --git a/gcc/var-tracking.cc b/gcc/var-tracking.cc
> index fae0c73e02f..384084c8b3e 100644
> --- a/gcc/var-tracking.cc
> +++ b/gcc/var-tracking.cc
> @@ -116,6 +116,7 @@
>  #include "fibonacci_heap.h"
>  #include "print-rtl.h"
>  #include "function-abi.h"
> +#include "mux-utils.h"
>  
>  typedef fibonacci_heap  bb_heap_t;
>  
> @@ -197,14 +198,14 @@ struct micro_operation
>  
>  
>  /* A declaration of a variable, or an RTL value being handled like a
> -   declaration.  */
> -typedef void *decl_or_value;
> +   declaration by pointer_mux.  */
> +typedef pointer_mux decl_or_value;
>  
>  /* Return true if a decl_or_value DV is a DECL or NULL.  */
>  static inline bool
>  dv_is_decl_p (decl_or_value dv)
>  {
> -  return !dv || (int) TREE_CODE ((tree) dv) != (int) VALUE;
> +  return dv.is_first ();
>  }
>  
>  /* Return true if a decl_or_value is a VALUE rtl.  */
> @@ -219,7 +220,7 @@ static inline tree
>  dv_as_decl (decl_or_value dv)
>  {
>gcc_checking_assert (dv_is_decl_p (dv));
> -  return (tree) dv;
> +  return dv.known_first ();
>  }
>  
>  /* Return the value in the decl_or_value.  */
> @@ -227,14 +228,7 @@ static inline rtx
>  dv_as_value (decl_or_value dv)
>  {
>gcc_checking_assert (dv_is_value_p (dv));
> -  return (rtx)dv;
> -}
> -
> -/* Return the opaque pointer in the decl_or_value.  */
> -static inline void *
> -dv_as_opaque (decl_or_value dv)
> -{
> -  return dv;
> +  return dv.known_second ();
>  }
>  
>  
> @@ -483,9 +477,9 @@ static void variable_htab_free (void *);
>  
>  struct variable_hasher : pointer_hash 
>  {
> -  typedef void *compare_type;
> +  typedef decl_or_value compare_type;
>static inline hashval_t hash (const variable *);
> -  static inline bool equal (const variable *, const void *);
> +  static inline bool equal (const variable *, const decl_or_value);
>static inline void remove (variable *);
>  };
>  
> @@ -501,11 +495,9 @@ variable_hasher::hash (const variable *v)
>  /* Compare the declaration of variable X with 

Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread Richard Sandiford via Gcc-patches
钟居哲  writes:
> I am sorry that I am still confused about that.
>
> Is this what you want ?
>
>   bool use_minus_p = TREE_CODE (step) == INTEGER_CST && ((TYPE_UNSIGNED 
> (TREE_TYPE (step)) && tree_int_cst_lt (step1, step))
>  || (!TYPE_UNSIGNED (TREE_TYPE (step)) && 
> !tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p 
> (step)));
>
>   /* For easier readability of the created code, produce MINUS_EXPRs
>  when suitable.  */
>   if (TREE_CODE (step) == INTEGER_CST)
> {
>   if (TYPE_UNSIGNED (TREE_TYPE (step)))
> {
>   step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   if (tree_int_cst_lt (step1, step))
> {
>   incr_op = MINUS_EXPR; /* Remove it.  */
>   step = step1;
> }
> }
>   else
> {
>   bool ovf;
>
>   if (!tree_expr_nonnegative_warnv_p (step, )
>   && may_negate_without_overflow_p (step))
> {
>   incr_op = MINUS_EXPR; /* Remove it.  */
>   step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
> }
> }
> }
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> {
>   if (TREE_CODE (base) == ADDR_EXPR)
> mark_addressable (TREE_OPERAND (base, 0));
>   step = convert_to_ptrofftype (step);
>   if (incr_op == MINUS_EXPR) /* Change it into if (use_minus_p)  */
> step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   incr_op = POINTER_PLUS_EXPR; /* Remove it.  */
> }
>   /* Gimplify the step if necessary.  We put the computations in front of the
>  loop (i.e. the step should be loop invariant).  */
>   step = force_gimple_operand (step, , true, NULL_TREE);
>   if (stmts)
> gsi_insert_seq_on_edge_immediate (pe, stmts);
>   
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> stmt = gimple_build_assign (va, POINTER_PLUS_EXPR, vb, step);
>   else if (use_minus_p)
> stmt = gimple_build_assign (va, MINUS_EXPR, vb, step);
>   else
> stmt = gimple_build_assign (va, incr_op, vb, step);
> ...
>
> Since I have no idea to make stmts flips between PLUS_EXPR and MINUS_EXPR.

No, I meant:

- Rename the "code" argument to "incr_op".

- Remove "tree_code incr_op = code;".

- Replace both instances of:

 incr_op = MINUS_EXPR;

  with:

 incr_op = (incr_op == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR);

The point is that the current code (rightly) assumes that incr_op
always starts out as PLUS_EXPR, i.e. that STEP starts out applying
positively.  Making STEP apply in the opposite direction is then as
simple as changing incr_op to MINUS_EXPR.  But the new interface
allows STEP to start out applying positively or negatively, and so
this code needs to cope with both cases.

Thanks,
Richard


Re: [PATCH v3] Var-Tracking: Typedef pointer_mux as decl_or_value

2023-05-10 Thread Richard Sandiford via Gcc-patches
"Li, Pan2"  writes:
> Thanks Richard Sandiford. Update PATCH v4 here -> 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618099.html.
>
>> -  if (dv_as_opaque (node->dv) != decl || node->offset != offset)
>> +  if (node->dv.first_or_null () != decl || node->offset != 
>> + offset)
>
>> Genuine question, but: is the first_or_null really needed?  I would have 
>> expected node->dv != decl to work, with an implicit conversion on the 
>> argument.
>
> Directly compare node->dv and decl may requires additional overload operator, 
> or it may complains similar as below. But I am afraid it is unreasonable to 
> add such kind of operator for one specific type RTX in pointer_mux up to a 
> point. Thus I think here we may need node->dv == (decl_or_val) decl here.
>
> ../../gcc/var-tracking.cc:3233:28: error: no match for 'operator!=' (operand 
> types are 'rtx' {aka 'rtx_def*'} and 'decl_or_value' {aka 
> 'pointer_mux'}).

Yeah, since we're adding operator== and operator!= as member operators,
the decl_or_value has to come first.  Please try the conditions in the
order that I'd written them in the review.

Thanks,
Richard


Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread Richard Sandiford via Gcc-patches
钟居哲  writes:
> Thanks Richard.
> I am planning to seperate a patch with only creat_iv stuff only.
>
> Are you suggesting that I remove "tree_code incr_op = code;"
> Use the argument directly ?
>
> I saw the codes here:
>
>   /* For easier readability of the created code, produce MINUS_EXPRs
>  when suitable.  */
>   if (TREE_CODE (step) == INTEGER_CST)
> {
>   if (TYPE_UNSIGNED (TREE_TYPE (step)))
> {
>   step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   if (tree_int_cst_lt (step1, step))
> {
>   incr_op = MINUS_EXPR;
>   step = step1;
> }
> }
>   else
> {
>   bool ovf;
>
>   if (!tree_expr_nonnegative_warnv_p (step, )
>   && may_negate_without_overflow_p (step))
> {
>   incr_op = MINUS_EXPR;
>   step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
> }
> }
> }
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> {
>   if (TREE_CODE (base) == ADDR_EXPR)
> mark_addressable (TREE_OPERAND (base, 0));
>   step = convert_to_ptrofftype (step);
>   if (incr_op == MINUS_EXPR)
> step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   incr_op = POINTER_PLUS_EXPR;
> }
>   /* Gimplify the step if necessary.  We put the computations in front of the
>  loop (i.e. the step should be loop invariant).  */
>   step = force_gimple_operand (step, , true, NULL_TREE);
>   if (stmts)
> gsi_insert_seq_on_edge_immediate (pe, stmts);
>
>   stmt = gimple_build_assign (va, incr_op, vb, step);
> ...
>
> It seems that it has complicated conditions here to change value of variable 
> "incr_op".
> That's why I define a temporary variable "tree_code incr_op = code;" here and
> let the following codes change the value of "incr_op".
>
> Could you give me some hints of dealing with this piece of code to get rid of 
> "tree_code incr_op = code;" ?

Yeah, but like I said in the review, those later:
 
  incr_op = MINUS_EXPR;
 
stmts need to be updated to something that flips between PLUS_EXPR
and MINUS_EXPR (with updates to the comments).  Just leaving them
as-is is incorrect (in cases where the caller passed MINUS_EXPR
rather than PLUS_EXPR).

The POINTER_PLUS_EXPR handling is fine due to the conditional
negate beforehand.

Thanks,
Richard


Re: [vxworks] [testsuite] [aarch64] use builtin in pred-not-gen-4.c

2023-05-10 Thread Richard Sandiford via Gcc-patches
Alexandre Oliva via Gcc-patches  writes:
> On vxworks, isunordered is defined as a macro that ultimately calls a
> _Fpcomp function, that GCC doesn't recognize as a builtin, so it
> can't optimize accordingly.
>
> Use __builtin_isunordered instead to get the desired code for the
> test.
>
> Regstrapped on x86_64-linux-gnu.  Also tested on aarch64-vx7r2 with
> gcc-12.  Ok to install?
>
>
> for  gcc/testsuite/ChangeLog
>
>   * gcc.target/aarch64/pred-not-gen-4.c: Drop math.h include,
>   call builtin.

OK, thanks.

Richard

> ---
>  .../gcc.target/aarch64/sve/pred-not-gen-4.c|4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
> index 0001dd3fc211f..1845bd3f0f704 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
> @@ -1,12 +1,10 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O3" } */
>  
> -#include 
> -
>  void f13(double * restrict z, double * restrict w, double * restrict x, 
> double * restrict y, int n)
>  {
>  for (int i = 0; i < n; i++) {
> -z[i] = (isunordered(w[i], 0)) ? x[i] + w[i] : y[i] - w[i];
> +z[i] = (__builtin_isunordered(w[i], 0)) ? x[i] + w[i] : y[i] - w[i];
>  }
>  }


Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread Richard Sandiford via Gcc-patches
In addition to Jeff's comments:

juzhe.zh...@rivai.ai writes:
> [...]
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index cc4a93a8763..99cf0cdbdca 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in vector will be updated 
> value.
> +operand 1 is the total elements need to be updated value.
> +operand 2 is the vectorization factor.
> +The value of operand 0 is target dependent and flexible in each iteration.
> +The operation of this pattern can be:
> +
> +@smallexample
> +Case 1:
> +operand0 = MIN (operand1, operand2);
> +operand2 can be const_poly_int or poly_int related to vector mode size.
> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE 
> SIZE) so
> +that we can reduce a use of general purpose register.
> +
> +In this case, only the last iteration of the loop is partial iteration.
> +@end smallexample
> +
> +@smallexample
> +Case 2:
> +if (operand1 <= operand2)
> +  operand0 = operand1;
> +else if (operand1 < 2 * operand2)
> +  operand0 = IN_RANGE (ceil (operand1 / 2), operand2);

GCC's IN_RANGE is a predicate, so it would be best to avoid that here.
Why isn't it simply ceil (operand1 / 2), which must be <= operand2?

> +else
> +  operand0 = operand2;
> +
> +This case will evenly distribute work over the last 2 iterations of a 
> stripmine loop.
> +@end smallexample
> +
> +The output of this pattern is not only used as IV of loop control counter, 
> but also
> +is used as the IV of address calculation with multiply/shift operation. This 
> allow
> +us dynamic adjust the number of elements is processed in each iteration of 
> the loop.
> +
>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> [...]
> diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc
> index 909b705d00d..5abca64379e 100644
> --- a/gcc/tree-ssa-loop-manip.cc
> +++ b/gcc/tree-ssa-loop-manip.cc
> @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
> so that we can free them all at once.  */
>  static bitmap_obstack loop_renamer_obstack;
>  
> -/* Creates an induction variable with value BASE + STEP * iteration in LOOP.
> +/* Creates an induction variable with value BASE (+/-) STEP * iteration in 
> LOOP.
> +   If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration.
> +   If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration.
> It is expected that neither BASE nor STEP are shared with other 
> expressions
> (unless the sharing rules allow this).  Use VAR as a base var_decl for it
> (if NULL, a new temporary will be created).  The increment will occur at
> @@ -57,8 +59,8 @@ static bitmap_obstack loop_renamer_obstack;
> VAR_AFTER (unless they are NULL).  */
>  
>  void
> -create_iv (tree base, tree step, tree var, class loop *loop,
> -gimple_stmt_iterator *incr_pos, bool after,
> +create_iv (tree base, tree_code code, tree step, tree var,
> +class loop *loop, gimple_stmt_iterator *incr_pos, bool after,
>  tree *var_before, tree *var_after)
>  {
>gassign *stmt;
> @@ -66,7 +68,9 @@ create_iv (tree base, tree step, tree var, class loop *loop,
>tree initial, step1;
>gimple_seq stmts;
>tree vb, va;
> -  enum tree_code incr_op = PLUS_EXPR;
> +  /* The code can only be PLUS_EXPR or MINUS_EXPR.  */
> +  gcc_assert (code == PLUS_EXPR || code == MINUS_EXPR);
> +  tree_code incr_op = code;

As Richard said, we should be able to get rid of incr_op, probably
by calling the parameter incr_op.

I think the later:

  incr_op = MINUS_EXPR;

stmts need to be updated to something that flips between PLUS_EXPR
and MINUS_EXPR (with updates to the comments).

It would probably make sense to split the create_iv part out as a
separate prepatch.

>edge pe = loop_preheader_edge (loop);
>  
>if (var != NULL_TREE)
> @@ -1365,7 +1369,7 @@ tree_transform_and_unroll_loop (class loop *loop, 
> unsigned factor,
>tree ctr_before, ctr_after;
>gimple_stmt_iterator bsi = gsi_last_nondebug_bb (new_exit->src);
>exit_if = as_a  (gsi_stmt (bsi));
> -  create_iv (exit_base, exit_step, NULL_TREE, loop,
> +  create_iv (exit_base, PLUS_EXPR, exit_step, NULL_TREE, loop,
>, false, _before, _after);
>gimple_cond_set_code (exit_if, exit_cmp);
>gimple_cond_set_lhs (exit_if, ctr_after);
> @@ -1580,8 +1584,8 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, 
> bool bump_in_latch)
>  gsi = gsi_last_bb (loop->latch);
>else
>  gsi = gsi_last_nondebug_bb (loop->header);
> -  create_iv (build_int_cst_type (type, 0), 

Re: [PATCH v3] Var-Tracking: Typedef pointer_mux as decl_or_value

2023-05-10 Thread Richard Sandiford via Gcc-patches
Thanks, mostly looks good to me.  Some minor comments below.

pan2...@intel.com writes:
> From: Pan Li 
>
> The decl_or_value is defined as void * before this PATCH. It will take
> care of both the tree_node and rtx_def. Unfortunately, given a void
> pointer cannot tell the input is tree_node or rtx_def.
>
> Then we have some implicit structure layout requirement similar as
> below. Or we will touch unreasonable bits when cast void * to tree_node
> or rtx_def.
>
> ++---+--+
> | offset | tree_node | rtx_def  |
> ++---+--+
> |  0 | code: 16  | code: 16 | <- require the location and bitssize
> ++---+--+
> | 16 | ...   | mode: 8  |
> ++---+--+
> | ...   |
> ++---+--+
> | 24 | ...   | ...  |
> ++---+--+
>
> This behavior blocks the PATCH that extend the rtx_def mode from 8 to
> 16 bits for running out of machine mode. This PATCH introduced the
> pointer_mux to tell the input is tree_node or rtx_def, and decouple
> the above implicition dependency.
>
> Signed-off-by: Pan Li 
> Co-Authored-By: Richard Sandiford 
> Co-Authored-By: Richard Biener 
> Co-Authored-By: Jakub Jelinek 
>
> gcc/ChangeLog:
>
>   * mux-utils.h: Add overload operator == and != for pointer_mux.
>   * var-tracking.cc: Included mux-utils.h for pointer_tmux.
>   (decl_or_value): Changed from void * to pointer_mux.
>   (dv_is_decl_p): Reconciled to the new type, aka pointer_mux.
>   (dv_as_decl): Ditto.
>   (dv_as_opaque): Removed due to unnecessary.
>   (struct variable_hasher): Take decl_or_value as compare_type.
>   (variable_hasher::equal): Diito.
>   (dv_from_decl): Reconciled to the new type, aka pointer_mux.
>   (dv_from_value): Ditto.
>   (attrs_list_member): Ditto.
>   (vars_copy): Ditto.
>   (var_reg_decl_set): Ditto.
>   (var_reg_delete_and_set): Ditto.
>   (find_loc_in_1pdv): Ditto.
>   (canonicalize_values_star): Ditto.
>   (variable_post_merge_new_vals): Ditto.
>   (dump_onepart_variable_differences): Ditto.
>   (variable_different_p): Ditto.
>   (variable_was_changed): Ditto.
>   (set_slot_part): Ditto.
>   (clobber_slot_part): Ditto.
>   (clobber_variable_part): Ditto.
>   (remove_value_from_changed_variables): Ditto.
>   (notify_dependents_of_changed_value): Ditto.
> ---
>  gcc/mux-utils.h | 12 ++
>  gcc/var-tracking.cc | 96 ++---
>  2 files changed, 51 insertions(+), 57 deletions(-)
>
> diff --git a/gcc/mux-utils.h b/gcc/mux-utils.h
> index a2b6a316899..adf3d3b722b 100644
> --- a/gcc/mux-utils.h
> +++ b/gcc/mux-utils.h
> @@ -72,6 +72,18 @@ public:
>// Return true unless the pointer is a null A pointer.
>explicit operator bool () const { return m_ptr; }
>  
> +  // Return true if class has the same m_ptr, or false.
> +  bool operator == (const pointer_mux ) const
> +{
> +  return this->m_ptr == other.m_ptr;
> +}
> +
> +  // Return true if class has the different m_ptr, or false.
> +  bool operator != (const pointer_mux ) const
> +{
> +  return this->m_ptr != other.m_ptr;
> +}
> +

The current code tries to follow the coding standard rule that functions
should be defined outside the class if the whole thing doesn't fit on
one line.  Admittedly that's not widely followed, but we might as well
continue to stick to it here.

The comment shouldn't talk about m_ptr, since that's an internal
implementation detail rather than a user-facing thing.  I think it's
OK to leave the functions uncommented, since it's obvious what ==
and != do.

>// Assign A and B pointers respectively.
>void set_first (T1 *ptr) { *this = first (ptr); }
>void set_second (T2 *ptr) { *this = second (ptr); }
> diff --git a/gcc/var-tracking.cc b/gcc/var-tracking.cc
> index fae0c73e02f..7a35f49020a 100644
> --- a/gcc/var-tracking.cc
> +++ b/gcc/var-tracking.cc
> @@ -116,6 +116,7 @@
>  #include "fibonacci_heap.h"
>  #include "print-rtl.h"
>  #include "function-abi.h"
> +#include "mux-utils.h"
>  
>  typedef fibonacci_heap  bb_heap_t;
>  
> @@ -197,14 +198,14 @@ struct micro_operation
>  
>  
>  /* A declaration of a variable, or an RTL value being handled like a
> -   declaration.  */
> -typedef void *decl_or_value;
> +   declaration by pointer_mux.  */
> +typedef pointer_mux decl_or_value;
>  
>  /* Return true if a decl_or_value DV is a DECL or NULL.  */
>  static inline bool
>  dv_is_decl_p (decl_or_value dv)
>  {
> -  return !dv || (int) TREE_CODE ((tree) dv) != (int) VALUE;
> +  return dv.is_first ();
>  }
>  
>  /* Return true if a decl_or_value is a VALUE rtl.  */
> @@ -219,7 +220,7 @@ static inline tree
>  dv_as_decl (decl_or_value dv)
>  {
>gcc_checking_assert (dv_is_decl_p (dv));
> -  return (tree) dv;
> +  return dv.known_first ();
>  }
>  
>  /* Return the value in the 

Re: [PATCH] vect: Missed opportunity to use [SU]ABD

2023-05-10 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Wed, May 10, 2023 at 11:49 AM Richard Biener
>  wrote:
>>
>> On Wed, May 10, 2023 at 11:01 AM Richard Sandiford
>>  wrote:
>> >
>> > Oluwatamilore Adebayo  writes:
>> > > From 0b5f469171c340ef61a48a31877d495bb77bd35f Mon Sep 17 00:00:00 2001
>> > > From: oluade01 
>> > > Date: Fri, 14 Apr 2023 10:24:43 +0100
>> > > Subject: [PATCH 1/4] Missed opportunity to use [SU]ABD
>> > >
>> > > This adds a recognition pattern for the non-widening
>> > > absolute difference (ABD).
>> > >
>> > > gcc/ChangeLog:
>> > >
>> > > * doc/md.texi (sabd, uabd): Document them.
>> > > * internal-fn.def (ABD): Use new optab.
>> > > * optabs.def (sabd_optab, uabd_optab): New optabs,
>> > > * tree-vect-patterns.cc (vect_recog_absolute_difference):
>> > > Recognize the following idiom abs (a - b).
>> > > (vect_recog_sad_pattern): Refactor to use
>> > > vect_recog_absolute_difference.
>> > > (vect_recog_abd_pattern): Use patterns found by
>> > > vect_recog_absolute_difference to build a new ABD
>> > > internal call.
>> > > ---
>> > >  gcc/doc/md.texi   |  10 ++
>> > >  gcc/internal-fn.def   |   3 +
>> > >  gcc/optabs.def|   2 +
>> > >  gcc/tree-vect-patterns.cc | 250 +-
>> > >  4 files changed, 234 insertions(+), 31 deletions(-)
>> > >
>> > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>> > > index 
>> > > 07bf8bdebffb2e523f25a41f2b57e43c0276b745..0ad546c63a8deebb4b6db894f437d1e21f0245a8
>> > >  100644
>> > > --- a/gcc/doc/md.texi
>> > > +++ b/gcc/doc/md.texi
>> > > @@ -5778,6 +5778,16 @@ Other shift and rotate instructions, analogous to 
>> > > the
>> > >  Vector shift and rotate instructions that take vectors as operand 2
>> > >  instead of a scalar type.
>> > >
>> > > +@cindex @code{uabd@var{m}} instruction pattern
>> > > +@cindex @code{sabd@var{m}} instruction pattern
>> > > +@item @samp{uabd@var{m}}, @samp{sabd@var{m}}
>> > > +Signed and unsigned absolute difference instructions.  These
>> > > +instructions find the difference between operands 1 and 2
>> > > +then return the absolute value.  A C code equivalent would be:
>> > > +@smallexample
>> > > +op0 = abs (op0 - op1)
>> >
>> > op0 = abs (op1 - op2)
>> >
>> > But that isn't the correct calculation for unsigned (where abs doesn't
>> > really work).  It also doesn't handle some cases correctly for signed.
>> >
>> > I think it's more:
>> >
>> >   op0 = op1 > op2 ? (unsigned type) op1 - op2 : (unsigned type) op2 - op1
>> >
>> > or (conceptually) max minus min.
>> >
>> > E.g. for 16-bit values, the absolute difference between signed 0x7fff
>> > and signed -0x8000 is 0x (reinterpreted as -1 if you cast back
>> > to signed).  But, ignoring undefined behaviour:
>> >
>> >   0x7fff - 0x8000 = -1
>> >   abs(-1) = 1
>> >
>> > which gives the wrong answer.
>> >
>> > We might still be able to fold C abs(a - b) to abd for signed a and b
>> > by relying on undefined behaviour (TYPE_OVERFLOW_UNDEFINED).  But we
>> > can't do it for -fwrapv.
>> >
>> > Richi knows better than me what would be appropriate here.
>>
>> The question is what does the hardware do?  For the widening [us]sad it's
>> obvious since the difference is computed in a wider signed mode and the
>> absolute value always fits.
>>
>> So what does it actually do, esp. when the difference yields 0x8000?
>
> A "sensible" definition would be that it works like the widening [us]sad
> and applies truncation to the result (modulo-reducing when the result
> isn't always unsigned).

Yeah.  Like Tami says, this is what the instruction does.

I think all three definitions are equivalent: the extend/operate/truncate
one, the ?: one above, and the "max - min" one.  Probably just personal
preference as to which seems more natural.

Reading the patch again, it does check TYPE_OVERFLOW_WRAPS, so -fwrapv
might be handled correctly after all.  Sorry for missing it first time.

On the patch:

> +/* Look for the following pattern
> + X = x[i]
> + Y = y[i]
> + DIFF = X - Y
> + DAD = ABS_EXPR
> + */
> +static bool
> +vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt,
> + tree *half_type, bool reject_unsigned,
> + vect_unpromoted_value unprom[2],
> + tree diff_oprnds[2])

It would be good to document what the parameters mean (except VINFO,
which is obvious).

> +  /* Peel off conversions from the ABS input.  This can involve sign
> + changes (e.g.  from an unsigned subtraction to a signed ABS input)
> + or signed promotion, but it can't include unsigned promotion.
> + (Note that ABS of an unsigned promotion should have been folded
> + away before now anyway.)  */
> +  vect_unpromoted_value unprom_diff;
> +  abs_oprnd = vect_look_through_possible_promotion (vinfo, abs_oprnd,
> + _diff);
> +  if 

Re: [PATCH v2] Var-Tracking: Typedef pointer_mux as decl_or_value

2023-05-10 Thread Richard Sandiford via Gcc-patches
Jakub Jelinek  writes:
> On Wed, May 10, 2023 at 07:57:05PM +0800, pan2...@intel.com wrote:
>> --- a/gcc/var-tracking.cc
>> +++ b/gcc/var-tracking.cc
>> @@ -116,9 +116,14 @@
>>  #include "fibonacci_heap.h"
>>  #include "print-rtl.h"
>>  #include "function-abi.h"
>> +#include "mux-utils.h"
>>  
>>  typedef fibonacci_heap  bb_heap_t;
>>  
>> +/* A declaration of a variable, or an RTL value being handled like a
>> +   declaration by pointer_mux.  */
>> +typedef pointer_mux decl_or_value;
>> +
>>  /* var-tracking.cc assumes that tree code with the same value as VALUE rtx 
>> code
>> has no chance to appear in REG_EXPR/MEM_EXPRs and isn't a decl.
>> Currently the value is the same as IDENTIFIER_NODE, which has such
>> @@ -196,15 +201,11 @@ struct micro_operation
>>  };
>>  
>>  
>> -/* A declaration of a variable, or an RTL value being handled like a
>> -   declaration.  */
>> -typedef void *decl_or_value;
>
> Why do you move the typedef?
>
>> @@ -503,9 +505,7 @@ variable_hasher::hash (const variable *v)
>>  inline bool
>>  variable_hasher::equal (const variable *v, const void *y)
>>  {
>> -  decl_or_value dv = CONST_CAST2 (decl_or_value, const void *, y);
>> -
>> -  return (dv_as_opaque (v->dv) == dv_as_opaque (dv));
>> +  return dv_as_opaque (v->dv) == y;
>>  }
>
> I'm not convinced this is correct.  I think all the find_slot_with_hash
> etc. pass in a decl_or_value, so I'd expect y to have decl_or_value
> type or something similar.
>
>>  /* Free the element of VARIABLE_HTAB (its type is struct variable_def).  */
>> @@ -1396,8 +1396,7 @@ onepart_pool_allocate (onepart_enum onepart)
>>  static inline decl_or_value
>>  dv_from_decl (tree decl)
>>  {
>> -  decl_or_value dv;
>> -  dv = decl;
>> +  decl_or_value dv = decl_or_value::first (decl);
>
> Can't you just decl_or_value dv = decl; ?  I think pointer_mux has ctors
> from pointers to the template parameter types.
>
>>gcc_checking_assert (dv_is_decl_p (dv));
>>return dv;
>>  }
>> @@ -1406,8 +1405,7 @@ dv_from_decl (tree decl)
>>  static inline decl_or_value
>>  dv_from_value (rtx value)
>>  {
>> -  decl_or_value dv;
>> -  dv = value;
>> +  decl_or_value dv = decl_or_value::second (value);
>
> Ditto.
>
>> @@ -1661,7 +1659,8 @@ shared_hash_find_slot_unshare_1 (shared_hash **pvars, 
>> decl_or_value dv,
>>  {
>>if (shared_hash_shared (*pvars))
>>  *pvars = shared_hash_unshare (*pvars);
>> -  return shared_hash_htab (*pvars)->find_slot_with_hash (dv, dvhash, ins);
>> +  return shared_hash_htab (*pvars)->find_slot_with_hash (dv_as_opaque (dv),
>> + dvhash, ins);
>
> Then you wouldn't need to change all these.

Also, please do try changing variable_hasher::compare_type to
decl_or_value, and changing the type of the second parameter to
variable_hasher::equal accordingly.  I still feel that we should
be able to get rid of dv_as_opaque entirely.

Thanks,
Richard


Re: [PATCH] vect: Missed opportunity to use [SU]ABD

2023-05-10 Thread Richard Sandiford via Gcc-patches
Oluwatamilore Adebayo  writes:
> From 0b5f469171c340ef61a48a31877d495bb77bd35f Mon Sep 17 00:00:00 2001
> From: oluade01 
> Date: Fri, 14 Apr 2023 10:24:43 +0100
> Subject: [PATCH 1/4] Missed opportunity to use [SU]ABD
>
> This adds a recognition pattern for the non-widening
> absolute difference (ABD).
>
> gcc/ChangeLog:
>
> * doc/md.texi (sabd, uabd): Document them.
> * internal-fn.def (ABD): Use new optab.
> * optabs.def (sabd_optab, uabd_optab): New optabs,
> * tree-vect-patterns.cc (vect_recog_absolute_difference):
> Recognize the following idiom abs (a - b).
> (vect_recog_sad_pattern): Refactor to use
> vect_recog_absolute_difference.
> (vect_recog_abd_pattern): Use patterns found by
> vect_recog_absolute_difference to build a new ABD
> internal call.
> ---
>  gcc/doc/md.texi   |  10 ++
>  gcc/internal-fn.def   |   3 +
>  gcc/optabs.def|   2 +
>  gcc/tree-vect-patterns.cc | 250 +-
>  4 files changed, 234 insertions(+), 31 deletions(-)
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 
> 07bf8bdebffb2e523f25a41f2b57e43c0276b745..0ad546c63a8deebb4b6db894f437d1e21f0245a8
>  100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5778,6 +5778,16 @@ Other shift and rotate instructions, analogous to the
>  Vector shift and rotate instructions that take vectors as operand 2
>  instead of a scalar type.
>
> +@cindex @code{uabd@var{m}} instruction pattern
> +@cindex @code{sabd@var{m}} instruction pattern
> +@item @samp{uabd@var{m}}, @samp{sabd@var{m}}
> +Signed and unsigned absolute difference instructions.  These
> +instructions find the difference between operands 1 and 2
> +then return the absolute value.  A C code equivalent would be:
> +@smallexample
> +op0 = abs (op0 - op1)

op0 = abs (op1 - op2)

But that isn't the correct calculation for unsigned (where abs doesn't
really work).  It also doesn't handle some cases correctly for signed.

I think it's more:

  op0 = op1 > op2 ? (unsigned type) op1 - op2 : (unsigned type) op2 - op1

or (conceptually) max minus min.

E.g. for 16-bit values, the absolute difference between signed 0x7fff
and signed -0x8000 is 0x (reinterpreted as -1 if you cast back
to signed).  But, ignoring undefined behaviour:

  0x7fff - 0x8000 = -1
  abs(-1) = 1

which gives the wrong answer.

We might still be able to fold C abs(a - b) to abd for signed a and b
by relying on undefined behaviour (TYPE_OVERFLOW_UNDEFINED).  But we
can't do it for -fwrapv.

Richi knows better than me what would be appropriate here.

Thanks,
Richard

> +@end smallexample
> +
>  @cindex @code{avg@var{m}3_floor} instruction pattern
>  @cindex @code{uavg@var{m}3_floor} instruction pattern
>  @item @samp{avg@var{m}3_floor}
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 
> 7fe742c2ae713e7152ab05cfdfba86e4e0aa3456..0f1724ecf37a31c231572edf90b5577e2d82f468
>  100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -167,6 +167,9 @@ DEF_INTERNAL_OPTAB_FN (FMS, ECF_CONST, fms, ternary)
>  DEF_INTERNAL_OPTAB_FN (FNMA, ECF_CONST, fnma, ternary)
>  DEF_INTERNAL_OPTAB_FN (FNMS, ECF_CONST, fnms, ternary)
>
> +DEF_INTERNAL_SIGNED_OPTAB_FN (ABD, ECF_CONST | ECF_NOTHROW, first,
> + sabd, uabd, binary)
> +
>  DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first,
>   savg_floor, uavg_floor, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 
> 695f5911b300c9ca5737de9be809fa01aabe5e01..29bc92281a2175f898634cbe6af63c18021e5268
>  100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -359,6 +359,8 @@ OPTAB_D (mask_fold_left_plus_optab, 
> "mask_fold_left_plus_$a")
>  OPTAB_D (extract_last_optab, "extract_last_$a")
>  OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")
>
> +OPTAB_D (uabd_optab, "uabd$a3")
> +OPTAB_D (sabd_optab, "sabd$a3")
>  OPTAB_D (savg_floor_optab, "avg$a3_floor")
>  OPTAB_D (uavg_floor_optab, "uavg$a3_floor")
>  OPTAB_D (savg_ceil_optab, "avg$a3_ceil")
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 
> a49b09539776c0056e77f99b10365d0a8747fbc5..91e1f9d4b610275dd833ec56dc77f76367ee7886
>  100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -770,6 +770,89 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info 
> stmt2_info, tree new_rhs,
>  }
>  }
>
> +/* Look for the following pattern
> +   X = x[i]
> +   Y = y[i]
> +   DIFF = X - Y
> +   DAD = ABS_EXPR
> + */
> +static bool
> +vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt,
> +   tree *half_type, bool reject_unsigned,
> +   vect_unpromoted_value unprom[2],
> +   tree diff_oprnds[2])
> +{
> +  if (!abs_stmt)
> +return 

Re: [PATCH] Var-Tracking: Leverage pointer_mux for decl_or_value

2023-05-10 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Wed, 10 May 2023, pan2...@intel.com wrote:
>
>> From: Pan Li 
>> 
>> The decl_or_value is defined as void * before this PATCH. It will take
>> care of both the tree_node and rtx_def. Unfortunately, given a void
>> pointer cannot tell the input is tree_node or rtx_def.
>> 
>> Then we have some implicit structure layout requirement similar as
>> below. Or we will touch unreasonable bits when cast void * to tree_node
>> or rtx_def.
>> 
>> ++---+--+
>> | offset | tree_node | rtx_def  |
>> ++---+--+
>> |  0 | code: 16  | code: 16 | <- require the location and bitssize
>> ++---+--+
>> | 16 | ...   | mode: 8  |
>> ++---+--+
>> | ...   |
>> ++---+--+
>> | 24 | ...   | ...  |
>> ++---+--+
>> 
>> This behavior blocks the PATCH that extend the rtx_def mode from 8 to
>> 16 bits for running out of machine mode. This PATCH introduced the
>> pointer_mux to tell the input is tree_node or rtx_def, and decouple
>> the above implicition dependency.
>> 
>> Signed-off-by: Pan Li 
>> Co-Authored-By: Richard Sandiford 
>> Co-Authored-By: Richard Biener 
>> 
>> gcc/ChangeLog:
>> 
>>  * var-tracking.cc (DECL_OR_VALUE_OR_DEFAULT): New macro for
>>clean code.
>>  (dv_is_decl_p): Adjust type changes to pointer_mux.
>>  (dv_as_decl): Likewise.
>>  (dv_as_value): Likewise.
>>  (dv_as_opaque): Likewise.
>>  (variable_hasher::equal): Likewise.
>>  (dv_from_decl): Likewise.
>>  (dv_from_value): Likewise.
>>  (shared_hash_find_slot_unshare_1): Likewise.
>>  (shared_hash_find_slot_1): Likewise.
>>  (shared_hash_find_slot_noinsert_1): Likewise.
>>  (shared_hash_find_1): Likewise.
>>  (unshare_variable): Likewise.
>>  (vars_copy): Likewise.
>>  (find_loc_in_1pdv): Likewise.
>>  (find_mem_expr_in_1pdv): Likewise.
>>  (dataflow_set_different): Likewise.
>>  (variable_from_dropped): Likewise.
>>  (variable_was_changed): Likewise.
>>  (loc_exp_insert_dep): Likewise.
>>  (notify_dependents_of_resolved_value): Likewise.
>>  (vt_expand_loc_callback): Likewise.
>>  (remove_value_from_changed_variables): Likewise.
>>  (notify_dependents_of_changed_value): Likewise.
>>  (emit_notes_for_differences_1): Likewise.
>>  (emit_notes_for_differences_2): Likewise.
>> ---
>>  gcc/var-tracking.cc | 119 +++-
>>  1 file changed, 74 insertions(+), 45 deletions(-)
>> 
>> diff --git a/gcc/var-tracking.cc b/gcc/var-tracking.cc
>> index fae0c73e02f..9bc9d21e5ba 100644
>> --- a/gcc/var-tracking.cc
>> +++ b/gcc/var-tracking.cc
>> @@ -116,9 +116,17 @@
>>  #include "fibonacci_heap.h"
>>  #include "print-rtl.h"
>>  #include "function-abi.h"
>> +#include "mux-utils.h"
>>  
>>  typedef fibonacci_heap  bb_heap_t;
>>  
>> +/* A declaration of a variable, or an RTL value being handled like a
>> +   declaration by pointer_mux.  */
>> +typedef pointer_mux decl_or_value;
>> +
>> +#define DECL_OR_VALUE_OR_DEFAULT(ptr) \
>> +  ((ptr) ? decl_or_value (ptr) : decl_or_value ())
>> +
>>  /* var-tracking.cc assumes that tree code with the same value as VALUE rtx 
>> code
>> has no chance to appear in REG_EXPR/MEM_EXPRs and isn't a decl.
>> Currently the value is the same as IDENTIFIER_NODE, which has such
>> @@ -196,15 +204,21 @@ struct micro_operation
>>  };
>>  
>>  
>> -/* A declaration of a variable, or an RTL value being handled like a
>> -   declaration.  */
>> -typedef void *decl_or_value;
>> -
>>  /* Return true if a decl_or_value DV is a DECL or NULL.  */
>>  static inline bool
>>  dv_is_decl_p (decl_or_value dv)
>>  {
>> -  return !dv || (int) TREE_CODE ((tree) dv) != (int) VALUE;
>> +  bool is_decl = !dv;
>> +
>> +  if (dv)
>> +{
>> +  if (dv.is_first ())
>> +is_decl = (int) TREE_CODE (dv.known_first ()) != (int) VALUE;
>> +  else if (!dv.is_first () && !dv.is_second ())
>> +is_decl = true;
>> +}
>> +
>> +  return is_decl;
>
> This all looks very confused, shouldn't it just be
>
>  return dv.is_first ();
>
> ?  All the keying on VALUE should no longer be necessary.
>
>>  }
>>  
>>  /* Return true if a decl_or_value is a VALUE rtl.  */
>> @@ -219,7 +233,7 @@ static inline tree
>>  dv_as_decl (decl_or_value dv)
>>  {
>>gcc_checking_assert (dv_is_decl_p (dv));
>> -  return (tree) dv;
>> +  return dv.is_first () ? dv.known_first () : NULL_TREE;
>
> and this should be
>
>  return dv.known_first ();
>
> ?  (knowing that ptr-mux will not mutate 'first' and thus preserves
> a nullptr there)
>
>>  }
>>  
>>  /* Return the value in the decl_or_value.  */
>> @@ -227,14 +241,20 @@ static inline rtx
>>  dv_as_value (decl_or_value dv)
>>  {
>>gcc_checking_assert (dv_is_value_p (dv));
>> -  return (rtx)dv;
>> +  return dv.is_second () ? dv.known_second () : NULL_RTX;;
>
> return 

Re: [PATCH 2/2] aarch64: Improve register allocation for lane instructions

2023-05-10 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Wed, May 10, 2023 at 12:05 AM Richard Sandiford via Gcc-patches
>  wrote:
>>
>> Andrew Pinski  writes:
>> >  On Tue, May 9, 2023 at 11:02 AM Richard Sandiford via Gcc-patches
>> >  wrote:
>> >>
>> >> REG_ALLOC_ORDER is much less important than it used to be, but it
>> >> is still used as a tie-breaker when multiple registers in a class
>> >> are equally good.
>> >
>> > This was tried before but had to be reverted. I have not looked into
>> > the history on why though.
>> > Anyways it was recorded as https://gcc.gnu.org/PR63521.
>>
>> It looks like that was about the traditional use of REG_ALLOC_ORDER:
>> putting call-clobbered registers first and defining
>> HONOR_REG_ALLOC_ORDER to make order trump IRA's usual costing.
>> We definitely don't want to do that, for the reasons described in the
>> patch and that Richard gave in comment 2.  (IRA already accounts for
>> call-preservedness.  It also accounts for conflicts with argument
>> registers, so putting those in reverse order shouldn't be necessary.)
>>
>> The aim here is different: to keep REG_ALLOC_ORDER as a pure tiebreaker,
>> but to avoid eating into restricted FP register classes when we don't
>> need to.
>
> I wonder if IRA/LRA could do this on its own - when a register belongs
> to multiple
> register classes and there's choice between two being in N and M register
> classes prefer the register that's in fewer register classes?  I mean,
> that's your
> logic - choose a register that leaves maximum freedom of use for the remaining
> registers?

Yeah, I wondered about that.  But the problem is that targets
tend to define classes for special purposes.  E.g. aarch64 has
TAILCALL_ADDR_REGS, which contains just x16 and x17.  But that class
is only used for the address in an indirect sibling call.  Something
that niche shouldn't affect the allocation of ordinary GPRs.

I also think it would be hard for a target-independent algorithm to do
a good job with the x86 register classes.

So in the end it seemed like some target-dependent knowledge was needed
to determine which classes are important enough and which aren't.

Thanks,
Richard


[PATCH 1/2] aarch64: Fix cut-&-pasto in aarch64-sve2-acle-asm.exp

2023-05-09 Thread Richard Sandiford via Gcc-patches
aarch64-sve2-acle-asm.exp tried to prevent --with-cpu/tune
from affecting the results, but it used sve_flags rather than
sve2_flags.  This was a silent failure when running the full
testsuite, but was a fatal error when running the harness
individually.

Tested on aarch64-linux-gnu, pushed to trunk.

Richard


gcc/testsuite/
* gcc.target/aarch64/sve2/acle/aarch64-sve2-acle-asm.exp: Use
sve2_flags instead of sve_flags.
---
 .../gcc.target/aarch64/sve2/acle/aarch64-sve2-acle-asm.exp  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/gcc/testsuite/gcc.target/aarch64/sve2/acle/aarch64-sve2-acle-asm.exp 
b/gcc/testsuite/gcc.target/aarch64/sve2/acle/aarch64-sve2-acle-asm.exp
index 2e8d78904c5..0ad6463d832 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/aarch64-sve2-acle-asm.exp
+++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/aarch64-sve2-acle-asm.exp
@@ -39,7 +39,7 @@ if { [check_effective_target_aarch64_sve2] } {
 
 # Turn off any codegen tweaks by default that may affect expected assembly.
 # Tests relying on those should turn them on explicitly.
-set sve_flags "$sve_flags -mtune=generic -moverride=tune=none"
+set sve2_flags "$sve2_flags -mtune=generic -moverride=tune=none"
 
 lappend extra_flags "-fno-ipa-icf"
 
-- 
2.25.1



Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-09 Thread Richard Sandiford via Gcc-patches
钟居哲  writes:
> Hi, Richards. I would like to give more information about this patch so that 
> it will make this patch easier for you to review.
>
> Currently, I saw we have 3 situations that we need to handle in case of loop 
> control IV in auto-vectorization:
> 1. Single rgroup loop control (ncopies == 1 && vec_num == 1 so 
> loop_len.length () == 1 or rgc->lengh () == 1)
> 2. Multiple rgroup for SLP.
> 3. Multiple rgroup for non-SLP which is Richard Sandiford point out 
> previously (For example, VEC_PACK_TRUNC).
>
> To talk about this patch, let me talk about RVV LLVM implementation first 
> which inspire me to send this patch:
> https://reviews.llvm.org/D99750 
>
> According to LLVM implementation, they are adding a middle-end IR called 
> "get_vector_length" which has totally
> same functionality as "select_vl" in this patch (I call it "while_len" 
> previously, now I rename it as "select_vl" following Richard suggestion).
>
> The LLVM implementation is only let "get_vector_length" calculate the number 
> of elements in single rgroup loop.
> For multi rgroup, let's take a look at it:
> https://godbolt.org/z/3GP78efTY 
>
> void
> foo1 (short *__restrict f, int *__restrict d, int n)
> {
>   for (int i = 0; i < n; ++i)
> {
>   f[i * 2 + 0] = 1;
>   f[i * 2 + 1] = 2;
>   d[i] = 3;
> }
> } 
>
> RISC-V Clang:
> foo1:   # @foo1
> # %bb.0:
> bleza2, .LBB0_8
> # %bb.1:
> li  a3, 16
> bgeua2, a3, .LBB0_3
> # %bb.2:
> li  a6, 0
> j   .LBB0_6
> .LBB0_3:
> andia6, a2, -16
> lui a3, 32
> addiw   a3, a3, 1
> vsetivlizero, 8, e32, m2, ta, ma
> vmv.v.x v8, a3
> vmv.v.i v10, 3
> mv  a4, a6
> mv  a5, a1
> mv  a3, a0
> .LBB0_4:# =>This Inner Loop Header: Depth=1
> addia7, a5, 32
> addit0, a3, 32
> vsetivlizero, 16, e16, m2, ta, ma
> vse16.v v8, (a3)
> vse16.v v8, (t0)
> vsetivlizero, 8, e32, m2, ta, ma
> vse32.v v10, (a5)
> vse32.v v10, (a7)
> addia3, a3, 64
> addia4, a4, -16
> addia5, a5, 64
> bneza4, .LBB0_4
> # %bb.5:
> beq a6, a2, .LBB0_8
> .LBB0_6:
> sllia3, a6, 2
> add a0, a0, a3
> addia0, a0, 2
> add a1, a1, a3
> sub a2, a2, a6
> li  a3, 1
> li  a4, 2
> li  a5, 3
> .LBB0_7:# =>This Inner Loop Header: Depth=1
> sh  a3, -2(a0)
> sh  a4, 0(a0)
> sw  a5, 0(a1)
> addia0, a0, 4
> addia2, a2, -1
> addia1, a1, 4
> bneza2, .LBB0_7
> .LBB0_8:
> ret
>
> ARM GCC:
> foo1:
> cmp w2, 0
> ble .L1
> addvl   x4, x0, #1
> mov x3, 0
> cntbx7
> cntbx6, all, mul #2
> sbfiz   x2, x2, 1, 32
> ptrue   p0.b, all
> mov x5, x2
> adrpx8, .LC0
> uqdech  x5
> add x8, x8, :lo12:.LC0
> whilelo p1.h, xzr, x5
> ld1rw   z1.s, p0/z, [x8]
> mov z0.s, #3
> whilelo p0.h, xzr, x2
> .L3:
> st1hz1.h, p0, [x0, x3, lsl 1]
> st1hz1.h, p1, [x4, x3, lsl 1]
> st1wz0.s, p1, [x1, #1, mul vl]
> add x3, x3, x7
> whilelo p1.h, x3, x5
> st1wz0.s, p0, [x1]
> add x1, x1, x6
> whilelo p0.h, x3, x2
> b.any   .L3
> .L1:
> ret
>
> It's very obvious that ARM GCC has much better codegen since RVV LLVM just 
> use SIMD style to handle multi-rgroup SLP auto-vectorization.
>
> Well, I am totally aggree that we should add length stuff in 
> auto-vectorization not only for single rgroup but also multiple rgroup.
> However, when I am trying to implement multiple rgroup length for both SLP 
> and non-SLP and testing, turns out it's hard to use select_vl
> since "select_vl" pattern allows non-VF flexible length (length <= min 
> (remain,VF)) in any iteration, it's consuming much more operations for
> adjust loop controls IV and data reference address point IV than just using 
> "MIN_EXPR".
>
> So for Case 2 && Case 3, I just use MIN_EXPR directly instead of SELECT_VL 
> after my serveral internal testing.

Could you go into more details about this?  I imagined that for case 3,
there would be a single SELECT_VL that decides how many scalar iterations
will be handled by the current vector iteration, then we would "expand"
the result (using MIN_EXPRs) to the multi-control cases.

In a sense that replicates what the SVE code above is doing.  But for SVE,
it's possible to "optimise" the unpacking of a WHILELO result due to the
lack of implementation-defined behaviour.  So conceptually we have a
single WHILELO that is 

Re: [PATCH] machine_mode type size: Extend enum size from 8-bit to 16-bit

2023-05-09 Thread Richard Sandiford via Gcc-patches
"Li, Pan2"  writes:
> After the bits patch like below.
>
> rtx_def code 16 => 8 bits.
> rtx_def mode 8 => 16 bits.
> tree_base code unchanged.
>
> The structure layout of both the rtx_def and tree_base will be something 
> similar as below. As I understand, the lower 8-bits of tree_base will be 
> inspected when 'dv' is a tree for the rtx conversion.
>
> tree_base rtx_def
> code: 16  code: 8
> side_effects_flag: 1  mode: 16

I think we should try hard to avoid that though.  The 16-bit value should
be aligned to 16 bits if at all possible.  decl_or_value doesn't seem
like something that should be dictating our approach here.

Perhaps we can use pointer_mux for decl_or_value instead?  pointer_mux is
intended to be a standands-compliant (hah!) way of switching between two
pointer types in a reasonably efficient way.

Thanks,
Richard

> constant_flag: 1
> addressable_flag: 1
> volatile_flag: 1
> readonly_flag: 1
> asm_written_flag: 1
> nowarning_flag: 1
> visited: 1
> used_flag: 1
> nothrow_flag: 1
> static_flag: 1
> public_flag: 1
> private_flag: 1
> protected_flag: 1
> deprecated_flag: 1
> default_def_flag: 1
>
> I have a try a similar approach (as below) as you mentioned, aka shrink 
> tree_code as 1:1 overlap to rtx_code. And completed one memory allocated 
> bytes test in another email.
>
> rtx_def code 16 => 12 bits.
> rtx_def mode 8 => 12 bits.
> tree_base code 16 => 12 bits.
>
> Pan
>
> -Original Message-
> From: Richard Biener  
> Sent: Monday, May 8, 2023 3:38 PM
> To: Li, Pan2 
> Cc: Jeff Law ; Kito Cheng ; 
> juzhe.zh...@rivai.ai; richard.sandiford ; 
> gcc-patches ; palmer ; jakub 
> 
> Subject: RE: [PATCH] machine_mode type size: Extend enum size from 8-bit to 
> 16-bit
>
> On Mon, 8 May 2023, Li, Pan2 wrote:
>
>> return !dv || (int) GET_CODE ((rtx) dv) != (int) VALUE; } is able to 
>> fix this ICE after mode bits change.
>
> Can you check which bits this will inspect when 'dv' is a tree after your 
> patch?  VALUE is 1 and would map to IDENTIFIER_NODE on the tree side when 
> there was a 1:1 overlap.
>
> I think for all cases but struct loc_exp_dep we could find a bit to record 
> wheter we deal with a VALUE or a decl, but for loc_exp_dep it's going to be 
> difficult (unless we start to take bits from pointer representations).
>
> That said, I agree with Jeff that the code is ugly, but a simplistic 
> conversion isn't what we want.
>
> An alternative "solution" might be to also shrink tree_code when we shrink 
> rtx_code and keep the 1:1 overlap.
>
> Richard.
>
>> I will re-trigger the memory allocate bytes test with below changes 
>> for X86.
>> 
>> rtx_def code 16 => 8 bits.
>> rtx_def mode 8 => 16 bits.
>> tree_base code unchanged.
>> 
>> Pan
>> 
>> -Original Message-
>> From: Li, Pan2
>> Sent: Monday, May 8, 2023 2:42 PM
>> To: Richard Biener ; Jeff Law 
>> 
>> Cc: Kito Cheng ; juzhe.zh...@rivai.ai; 
>> richard.sandiford ; gcc-patches 
>> ; palmer ; jakub 
>> 
>> Subject: RE: [PATCH] machine_mode type size: Extend enum size from 
>> 8-bit to 16-bit
>> 
>> Oops. Actually I am patching a version as you mentioned like storage 
>> allocation. Thank you Richard, will try your suggestion and keep you posted.
>> 
>> Pan
>> 
>> -Original Message-
>> From: Richard Biener 
>> Sent: Monday, May 8, 2023 2:30 PM
>> To: Jeff Law 
>> Cc: Li, Pan2 ; Kito Cheng ; 
>> juzhe.zh...@rivai.ai; richard.sandiford ; 
>> gcc-patches ; palmer ; 
>> jakub 
>> Subject: Re: [PATCH] machine_mode type size: Extend enum size from 
>> 8-bit to 16-bit
>> 
>> On Sun, 7 May 2023, Jeff Law wrote:
>> 
>> > 
>> > 
>> > On 5/6/23 19:55, Li, Pan2 wrote:
>> > > It looks like we cannot simply swap the code and mode in rtx_def, 
>> > > the code may have to be the same bits as the tree_code in tree_base.
>> > > Or we will meet ICE like below.
>> > > 
>> > > rtx_def code 16 => 8 bits.
>> > > rtx_def mode 8 => 16 bits.
>> > > 
>> > > static inline decl_or_value
>> > > dv_from_value (rtx value)
>> > > {
>> > >decl_or_value dv;
>> > >dv = value;
>> > >gcc_checking_assert (dv_is_value_p (dv));  <=  ICE
>> > >return dv;
>> > Ugh.  We really just need to fix this code.  It assumes particular 
>> > structure layouts and that's just wrong/dumb.
>> 
>> Well, it's a neat trick ... we just need to adjust it to
>> 
>> static inline bool
>> dv_is_decl_p (decl_or_value dv)
>> {
>>   return !dv || (int) GET_CODE ((rtx) dv) != (int) VALUE; }
>> 
>> I think (and hope for the 'decl' case the bits inspected are never 'VALUE'). 
>>  Of course the above stinks from a TBAA perspective ...
>> 
>> Any "real" fix would require allocating storage for a discriminator and thus 
>> hurt the resource constrained var-tracking a lot.
>> 
>> Richard.
>> 
>
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, 
> Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 
> 36809 (AG Nuernberg)


[PATCH 6/6] aarch64: Avoid hard-coding specific register allocations

2023-05-09 Thread Richard Sandiford via Gcc-patches
Some tests hard-coded specific allocations for temporary registers,
whereas the RA should be free to pick anything that doesn't force
unnecessary moves or spills.

gcc/testsuite/
* gcc.target/aarch64/asimd-mul-to-shl-sub.c: Allow any register
allocation for temporary results, rather than requiring specific
registers.
* gcc.target/aarch64/auto-init-padding-1.c: Likewise.
* gcc.target/aarch64/auto-init-padding-2.c: Likewise.
* gcc.target/aarch64/auto-init-padding-3.c: Likewise.
* gcc.target/aarch64/auto-init-padding-4.c: Likewise.
* gcc.target/aarch64/auto-init-padding-9.c: Likewise.
* gcc.target/aarch64/memset-corner-cases.c: Likewise.
* gcc.target/aarch64/memset-q-reg.c: Likewise.
* gcc.target/aarch64/simd/vaddlv_1.c: Likewise.
* gcc.target/aarch64/sve-neon-modes_1.c: Likewise.
* gcc.target/aarch64/sve-neon-modes_3.c: Likewise.
* gcc.target/aarch64/sve/load_scalar_offset_1.c: Likewise.
* gcc.target/aarch64/sve/pcs/return_6_256.c: Likewise.
* gcc.target/aarch64/sve/pcs/return_6_512.c: Likewise.
* gcc.target/aarch64/sve/pcs/return_6_1024.c: Likewise.
* gcc.target/aarch64/sve/pcs/return_6_2048.c: Likewise.
* gcc.target/aarch64/sve/pr89007-1.c: Likewise.
* gcc.target/aarch64/sve/pr89007-2.c: Likewise.
* gcc.target/aarch64/sve/store_scalar_offset_1.c: Likewise.
* gcc.target/aarch64/vadd_reduc-1.c: Likewise.
* gcc.target/aarch64/vadd_reduc-2.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_bf16.c: Allow the temporary
predicate register to be any of p4-p7, rather than requiring p4
specifically.
* gcc.target/aarch64/sve/pcs/args_5_be_f16.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_f32.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_f64.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_s8.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_s16.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_s32.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_s64.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_u8.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_u16.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_u32.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_5_be_u64.c: Likewise.
---
 .../gcc.target/aarch64/asimd-mul-to-shl-sub.c |  4 +-
 .../gcc.target/aarch64/auto-init-padding-1.c  |  2 +-
 .../gcc.target/aarch64/auto-init-padding-2.c  |  3 +-
 .../gcc.target/aarch64/auto-init-padding-3.c  |  3 +-
 .../gcc.target/aarch64/auto-init-padding-4.c  |  3 +-
 .../gcc.target/aarch64/auto-init-padding-9.c  |  2 +-
 .../gcc.target/aarch64/memset-corner-cases.c  | 22 -
 .../gcc.target/aarch64/memset-q-reg.c | 22 -
 .../gcc.target/aarch64/simd/vaddlv_1.c| 24 +-
 .../gcc.target/aarch64/sve-neon-modes_1.c |  4 +-
 .../gcc.target/aarch64/sve-neon-modes_3.c | 16 +++
 .../aarch64/sve/load_scalar_offset_1.c|  8 ++--
 .../aarch64/sve/pcs/args_5_be_bf16.c  | 18 +++
 .../aarch64/sve/pcs/args_5_be_f16.c   | 18 +++
 .../aarch64/sve/pcs/args_5_be_f32.c   | 18 +++
 .../aarch64/sve/pcs/args_5_be_f64.c   | 18 +++
 .../aarch64/sve/pcs/args_5_be_s16.c   | 18 +++
 .../aarch64/sve/pcs/args_5_be_s32.c   | 18 +++
 .../aarch64/sve/pcs/args_5_be_s64.c   | 18 +++
 .../gcc.target/aarch64/sve/pcs/args_5_be_s8.c | 18 +++
 .../aarch64/sve/pcs/args_5_be_u16.c   | 18 +++
 .../aarch64/sve/pcs/args_5_be_u32.c   | 18 +++
 .../aarch64/sve/pcs/args_5_be_u64.c   | 18 +++
 .../gcc.target/aarch64/sve/pcs/args_5_be_u8.c | 18 +++
 .../aarch64/sve/pcs/return_6_1024.c   | 48 +--
 .../aarch64/sve/pcs/return_6_2048.c   | 48 +--
 .../gcc.target/aarch64/sve/pcs/return_6_256.c | 48 +--
 .../gcc.target/aarch64/sve/pcs/return_6_512.c | 48 +--
 .../gcc.target/aarch64/sve/pr89007-1.c|  2 +-
 .../gcc.target/aarch64/sve/pr89007-2.c|  2 +-
 .../aarch64/sve/store_scalar_offset_1.c   |  8 ++--
 .../gcc.target/aarch64/vadd_reduc-1.c |  4 +-
 .../gcc.target/aarch64/vadd_reduc-2.c |  4 +-
 33 files changed, 269 insertions(+), 272 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mul-to-shl-sub.c 
b/gcc/testsuite/gcc.target/aarch64/asimd-mul-to-shl-sub.c
index d7c5e5f341b..28dbe81a37d 100644
--- a/gcc/testsuite/gcc.target/aarch64/asimd-mul-to-shl-sub.c
+++ b/gcc/testsuite/gcc.target/aarch64/asimd-mul-to-shl-sub.c
@@ -4,8 +4,8 @@
 
 /*
 **foo:
-** shl v1.4s, v0.4s, 16
-** sub v0.4s, v1.4s, v0.4s
+** shl (v[0-9]+.4s), v0.4s, 16
+** sub v0.4s, \1, v0.4s
 ** ret
 */
 #include 
diff --git 

[PATCH 5/6] aarch64: Relax FP/vector register matches

2023-05-09 Thread Richard Sandiford via Gcc-patches
There were many tests that used [0-9] to match an FP or vector register,
but that should allow any of 0-31 instead.

asm-x-constraint-1.c required s0-s7, but that's the range for "y"
rather than "x".  "x" allows s0-s15.

sve/pcs/return_9.c required z2-z7 (the initial set of available
call-clobbered registers), but z24-z31 are OK too.

gcc/testsuite/
* gcc.target/aarch64/advsimd-intrinsics/vshl-opt-6.c: Allow any
FP/vector register, not just register 0-9.
* gcc.target/aarch64/fmul_fcvt_2.c: Likewise.
* gcc.target/aarch64/ldp_stp_8.c: Likewise.
* gcc.target/aarch64/ldp_stp_17.c: Likewise.
* gcc.target/aarch64/ldp_stp_21.c: Likewise.
* gcc.target/aarch64/simd/vpaddd_f64.c: Likewise.
* gcc.target/aarch64/simd/vpaddd_s64.c: Likewise.
* gcc.target/aarch64/simd/vpaddd_u64.c: Likewise.
* gcc.target/aarch64/sve/adr_1.c: Likewise.
* gcc.target/aarch64/sve/adr_2.c: Likewise.
* gcc.target/aarch64/sve/adr_3.c: Likewise.
* gcc.target/aarch64/sve/adr_4.c: Likewise.
* gcc.target/aarch64/sve/adr_5.c: Likewise.
* gcc.target/aarch64/sve/extract_1.c: Likewise.
* gcc.target/aarch64/sve/extract_2.c: Likewise.
* gcc.target/aarch64/sve/extract_3.c: Likewise.
* gcc.target/aarch64/sve/extract_4.c: Likewise.
* gcc.target/aarch64/sve/slp_4.c: Likewise.
* gcc.target/aarch64/sve/spill_3.c: Likewise.
* gcc.target/aarch64/vfp-1.c: Likewise.
* gcc.target/aarch64/asm-x-constraint-1.c: Allow s0-s15, not just
s0-s7.
* gcc.target/aarch64/sve/pcs/return_9.c: Allow z24-z31 as well as
z2-z7.
---
 .../aarch64/advsimd-intrinsics/vshl-opt-6.c   |  2 +-
 .../gcc.target/aarch64/asm-x-constraint-1.c   |  4 ++--
 .../gcc.target/aarch64/fmul_fcvt_2.c  |  6 ++---
 gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c |  2 +-
 gcc/testsuite/gcc.target/aarch64/ldp_stp_21.c |  2 +-
 gcc/testsuite/gcc.target/aarch64/ldp_stp_8.c  |  2 +-
 .../gcc.target/aarch64/simd/vpaddd_f64.c  |  2 +-
 .../gcc.target/aarch64/simd/vpaddd_s64.c  |  2 +-
 .../gcc.target/aarch64/simd/vpaddd_u64.c  |  2 +-
 gcc/testsuite/gcc.target/aarch64/sve/adr_1.c  | 24 +--
 gcc/testsuite/gcc.target/aarch64/sve/adr_2.c  | 24 +--
 gcc/testsuite/gcc.target/aarch64/sve/adr_3.c  | 24 +--
 gcc/testsuite/gcc.target/aarch64/sve/adr_4.c  |  6 ++---
 gcc/testsuite/gcc.target/aarch64/sve/adr_5.c  | 16 ++---
 .../gcc.target/aarch64/sve/extract_1.c|  4 ++--
 .../gcc.target/aarch64/sve/extract_2.c|  4 ++--
 .../gcc.target/aarch64/sve/extract_3.c|  4 ++--
 .../gcc.target/aarch64/sve/extract_4.c|  4 ++--
 .../gcc.target/aarch64/sve/pcs/return_9.c | 16 ++---
 gcc/testsuite/gcc.target/aarch64/sve/slp_4.c  |  2 +-
 .../gcc.target/aarch64/sve/spill_3.c  |  8 +++
 gcc/testsuite/gcc.target/aarch64/vfp-1.c  |  4 ++--
 22 files changed, 82 insertions(+), 82 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vshl-opt-6.c 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vshl-opt-6.c
index 442e3163237..3eff71b53fa 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vshl-opt-6.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vshl-opt-6.c
@@ -7,4 +7,4 @@ int32x4_t foo (int32x4_t x) {
   return vshlq_s32(x, vdupq_n_s32(256));
 }
 
-/* { dg-final { scan-assembler-times {\tsshl\t.+, v[0-9].4s} 1 } } */
+/* { dg-final { scan-assembler-times {\tsshl\t.+, v[0-9]+.4s} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/asm-x-constraint-1.c 
b/gcc/testsuite/gcc.target/aarch64/asm-x-constraint-1.c
index a71043be504..ecfb01d247e 100644
--- a/gcc/testsuite/gcc.target/aarch64/asm-x-constraint-1.c
+++ b/gcc/testsuite/gcc.target/aarch64/asm-x-constraint-1.c
@@ -28,7 +28,7 @@ f (void)
 /* { dg-final { scan-assembler {\t// s7 out: s7\n.*[/]/ s7 in: s7\n} } } */
 /* { dg-final { scan-assembler {\t// s8 out: s8\n.*[/]/ s8 in: s8\n} } } */
 /* { dg-final { scan-assembler {\t// s15 out: s15\n.*[/]/ s15 in: s15\n} } } */
-/* { dg-final { scan-assembler {\t// s16 out: s16\n.*\tfmov\t(s[0-7]), 
s16\n.*[/]/ s16 in: \1\n} } } */
-/* { dg-final { scan-assembler {\t// s31 out: s31\n.*\tfmov\t(s[0-7]), 
s31\n.*[/]/ s31 in: \1\n} } } */
+/* { dg-final { scan-assembler {\t// s16 out: s16\n.*\tfmov\t(s[0-9]|s1[0-5]), 
s16\n.*[/]/ s16 in: \1\n} } } */
+/* { dg-final { scan-assembler {\t// s31 out: s31\n.*\tfmov\t(s[0-9]|s1[0-5]), 
s31\n.*[/]/ s31 in: \1\n} } } */
 /* { dg-final { scan-assembler-not {\t// s16 in: s16\n} } } */
 /* { dg-final { scan-assembler-not {\t// s31 in: s31\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/fmul_fcvt_2.c 
b/gcc/testsuite/gcc.target/aarch64/fmul_fcvt_2.c
index 8f0240bf5f7..6cb269cf7ae 100644
--- a/gcc/testsuite/gcc.target/aarch64/fmul_fcvt_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/fmul_fcvt_2.c
@@ -64,6 +64,6 @@ main (void)

[PATCH 2/6] aarch64: Allow moves after tied-register intrinsics

2023-05-09 Thread Richard Sandiford via Gcc-patches
Some ACLE intrinsics map to instructions that tie the output
operand to an input operand.  If all the operands are allocated
to different registers, and if MOVPRFX can't be used, we will need
a move either before the instruction or after it.  Many tests only
matched the "before" case; this patch makes them accept the "after"
case too.

gcc/testsuite/
* gcc.target/aarch64/advsimd-intrinsics/bfcvtnq2-untied.c: Allow
moves to occur after the intrinsic instruction, rather than requiring
them to happen before.
* gcc.target/aarch64/advsimd-intrinsics/bfdot-1.c: Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vdot-3-1.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/adda_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/adda_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/adda_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/brka_b.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/brkb_b.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/brkn_b.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/clasta_bf16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/clasta_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/clasta_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/clasta_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/clastb_bf16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/clastb_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/clastb_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/clastb_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/pfirst_b.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/pnext_b16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/pnext_b32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/pnext_b64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/pnext_b8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sli_s16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sli_s32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sli_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sli_s8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sli_u16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sli_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sli_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sli_u8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sri_s16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sri_s32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sri_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sri_s8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sri_u16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sri_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sri_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sri_u8.c: Likewise.
---
 .../aarch64/advsimd-intrinsics/bfcvtnq2-untied.c  |  5 +
 .../aarch64/advsimd-intrinsics/bfdot-1.c  | 10 ++
 .../aarch64/advsimd-intrinsics/vdot-3-1.c | 10 ++
 .../gcc.target/aarch64/sve/acle/asm/adda_f16.c|  5 +
 .../gcc.target/aarch64/sve/acle/asm/adda_f32.c|  5 +
 .../gcc.target/aarch64/sve/acle/asm/adda_f64.c|  5 +
 .../gcc.target/aarch64/sve/acle/asm/brka_b.c  |  5 +
 .../gcc.target/aarch64/sve/acle/asm/brkb_b.c  |  5 +
 .../gcc.target/aarch64/sve/acle/asm/brkn_b.c  |  5 +
 .../gcc.target/aarch64/sve/acle/asm/clasta_bf16.c |  5 +
 .../gcc.target/aarch64/sve/acle/asm/clasta_f16.c  |  5 +
 .../gcc.target/aarch64/sve/acle/asm/clasta_f32.c  |  5 +
 .../gcc.target/aarch64/sve/acle/asm/clasta_f64.c  |  5 +
 .../gcc.target/aarch64/sve/acle/asm/clastb_bf16.c |  5 +
 .../gcc.target/aarch64/sve/acle/asm/clastb_f16.c  |  5 +
 .../gcc.target/aarch64/sve/acle/asm/clastb_f32.c  |  5 +
 .../gcc.target/aarch64/sve/acle/asm/clastb_f64.c  |  5 +
 .../gcc.target/aarch64/sve/acle/asm/pfirst_b.c|  5 +
 .../gcc.target/aarch64/sve/acle/asm/pnext_b16.c   |  5 +
 .../gcc.target/aarch64/sve/acle/asm/pnext_b32.c   |  5 +
 .../gcc.target/aarch64/sve/acle/asm/pnext_b64.c   |  5 +
 .../gcc.target/aarch64/sve/acle/asm/pnext_b8.c|  5 +
 .../gcc.target/aarch64/sve2/acle/asm/sli_s16.c| 15 +++
 .../gcc.target/aarch64/sve2/acle/asm/sli_s32.c| 15 +++
 .../gcc.target/aarch64/sve2/acle/asm/sli_s64.c| 15 +++
 .../gcc.target/aarch64/sve2/acle/asm/sli_s8.c | 15 +++
 .../gcc.target/aarch64/sve2/acle/asm/sli_u16.c| 15 +++
 .../gcc.target/aarch64/sve2/acle/asm/sli_u32.c| 15 +++
 .../gcc.target/aarch64/sve2/acle/asm/sli_u64.c| 15 +++
 .../gcc.target/aarch64/sve2/acle/asm/sli_u8.c | 15 +++
 .../gcc.target/aarch64/sve2/acle/asm/sri_s16.c| 15 +++
 .../gcc.target/aarch64/sve2/acle/asm/sri_s32.c 

[PATCH 4/6] aarch64: Relax predicate register matches

2023-05-09 Thread Richard Sandiford via Gcc-patches
Most governing predicate operands require p0-p7, but some
instructions also allow p8-p15.  Non-gp uses of predicates
often also allow all of p0-p15.

This patch fixes up cases where we required p0-p7 unnecessarily.
In some cases we match the definition (typically a comparison,
PFALSE or PTRUE), sometimes we match the use (like a logic
instruction, MOV or SEL), and sometimes we match both.

gcc/testsuite/
* g++.target/aarch64/sve/vcond_1.C: Allow any predicate
register for the temporary results, not just p0-p7.
* gcc.target/aarch64/sve/acle/asm/dupq_b8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dupq_b16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dupq_b32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dupq_b64.c: Likewise.
* gcc.target/aarch64/sve/acle/general/whilele_5.c: Likewise.
* gcc.target/aarch64/sve/acle/general/whilele_6.c: Likewise.
* gcc.target/aarch64/sve/acle/general/whilele_7.c: Likewise.
* gcc.target/aarch64/sve/acle/general/whilele_9.c: Likewise.
* gcc.target/aarch64/sve/acle/general/whilele_10.c: Likewise.
* gcc.target/aarch64/sve/acle/general/whilelt_1.c: Likewise.
* gcc.target/aarch64/sve/acle/general/whilelt_2.c: Likewise.
* gcc.target/aarch64/sve/acle/general/whilelt_3.c: Likewise.
* gcc.target/aarch64/sve/pcs/varargs_1.c: Likewise.
* gcc.target/aarch64/sve/peel_ind_2.c: Likewise.
* gcc.target/aarch64/sve/mask_gather_load_6.c: Likewise.
* gcc.target/aarch64/sve/vcond_2.c: Likewise.
* gcc.target/aarch64/sve/vcond_3.c: Likewise.
* gcc.target/aarch64/sve/vcond_7.c: Likewise.
* gcc.target/aarch64/sve/vcond_18.c: Likewise.
* gcc.target/aarch64/sve/vcond_19.c: Likewise.
* gcc.target/aarch64/sve/vcond_20.c: Likewise.
---
 .../g++.target/aarch64/sve/vcond_1.C  | 258 +-
 .../aarch64/sve/acle/asm/dupq_b16.c   |  86 +++---
 .../aarch64/sve/acle/asm/dupq_b32.c   |  48 ++--
 .../aarch64/sve/acle/asm/dupq_b64.c   |  16 +-
 .../gcc.target/aarch64/sve/acle/asm/dupq_b8.c | 136 -
 .../aarch64/sve/acle/general/whilele_10.c |   2 +-
 .../aarch64/sve/acle/general/whilele_5.c  |  10 +-
 .../aarch64/sve/acle/general/whilele_6.c  |   2 +-
 .../aarch64/sve/acle/general/whilele_7.c  |   6 +-
 .../aarch64/sve/acle/general/whilele_9.c  |   6 +-
 .../aarch64/sve/acle/general/whilelt_1.c  |  10 +-
 .../aarch64/sve/acle/general/whilelt_2.c  |   2 +-
 .../aarch64/sve/acle/general/whilelt_3.c  |   6 +-
 .../aarch64/sve/mask_gather_load_6.c  |   4 +-
 .../gcc.target/aarch64/sve/pcs/varargs_1.c|   8 +-
 .../gcc.target/aarch64/sve/peel_ind_2.c   |   2 +-
 .../gcc.target/aarch64/sve/vcond_18.c |  14 +-
 .../gcc.target/aarch64/sve/vcond_19.c |  34 +--
 .../gcc.target/aarch64/sve/vcond_2.c  | 248 -
 .../gcc.target/aarch64/sve/vcond_20.c |  34 +--
 .../gcc.target/aarch64/sve/vcond_3.c  |  26 +-
 .../gcc.target/aarch64/sve/vcond_7.c  | 198 +++---
 22 files changed, 578 insertions(+), 578 deletions(-)

diff --git a/gcc/testsuite/g++.target/aarch64/sve/vcond_1.C 
b/gcc/testsuite/g++.target/aarch64/sve/vcond_1.C
index da52c4c1359..3e7de9b455a 100644
--- a/gcc/testsuite/g++.target/aarch64/sve/vcond_1.C
+++ b/gcc/testsuite/g++.target/aarch64/sve/vcond_1.C
@@ -112,132 +112,132 @@ TYPE vcond_imm_##TYPE##_##SUFFIX (TYPE x, TYPE y, TYPE 
a)   \
 TEST_VAR_ALL (DEF_VCOND_VAR)
 TEST_IMM_ALL (DEF_VCOND_IMM)
 
-/* { dg-final { scan-assembler {\tsel\tz[0-9]+\.b, p[0-7], z[0-9]+\.b, 
z[0-9]+\.b\n} } } */
-/* { dg-final { scan-assembler {\tsel\tz[0-9]+\.h, p[0-7], z[0-9]+\.h, 
z[0-9]+\.h\n} } } */
-/* { dg-final { scan-assembler {\tsel\tz[0-9]+\.s, p[0-7], z[0-9]+\.s, 
z[0-9]+\.s\n} } } */
-/* { dg-final { scan-assembler {\tsel\tz[0-9]+\.d, p[0-7], z[0-9]+\.d, 
z[0-9]+\.d\n} } } */
-
-/* { dg-final { scan-assembler {\tcmpgt\tp[0-7]\.b, p[0-7]/z, z[0-9]+\.b, 
z[0-9]+\.b\n} } } */
-/* { dg-final { scan-assembler {\tcmpgt\tp[0-7]\.h, p[0-7]/z, z[0-9]+\.h, 
z[0-9]+\.h\n} } } */
-/* { dg-final { scan-assembler {\tcmpgt\tp[0-7]\.s, p[0-7]/z, z[0-9]+\.s, 
z[0-9]+\.s\n} } } */
-/* { dg-final { scan-assembler {\tcmpgt\tp[0-7]\.d, p[0-7]/z, z[0-9]+\.d, 
z[0-9]+\.d\n} } } */
-
-/* { dg-final { scan-assembler {\tcmphi\tp[0-7]\.b, p[0-7]/z, z[0-9]+\.b, 
z[0-9]+\.b\n} } } */
-/* { dg-final { scan-assembler {\tcmphi\tp[0-7]\.h, p[0-7]/z, z[0-9]+\.h, 
z[0-9]+\.h\n} } } */
-/* { dg-final { scan-assembler {\tcmphi\tp[0-7]\.s, p[0-7]/z, z[0-9]+\.s, 
z[0-9]+\.s\n} } } */
-/* { dg-final { scan-assembler {\tcmphi\tp[0-7]\.d, p[0-7]/z, z[0-9]+\.d, 
z[0-9]+\.d\n} } } */
-
-/* { dg-final { scan-assembler {\tcmphs\tp[0-7]\.b, p[0-7]/z, z[0-9]+\.b, 
z[0-9]+\.b\n} } } */
-/* { dg-final { scan-assembler {\tcmphs\tp[0-7]\.h, p[0-7]/z, z[0-9]+\.h, 
z[0-9]+\.h\n} } } */
-/* { dg-final { scan-assembler 

[PATCH 3/6] aarch64: Relax ordering requirements in SVE dup tests

2023-05-09 Thread Richard Sandiford via Gcc-patches
Some of the svdup tests expand to a SEL between two constant vectors.
This patch allows the constants to be formed in either order.

gcc/testsuite/
* gcc.target/aarch64/sve/acle/asm/dup_s16.c: When using SEL to select
between two constant vectors, allow the constant moves to appear in
either order.
* gcc.target/aarch64/sve/acle/asm/dup_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_u64.c: Likewise.
---
 .../gcc.target/aarch64/sve/acle/asm/dup_s16.c | 72 +++
 .../gcc.target/aarch64/sve/acle/asm/dup_s32.c | 60 
 .../gcc.target/aarch64/sve/acle/asm/dup_s64.c | 60 
 .../gcc.target/aarch64/sve/acle/asm/dup_u16.c | 72 +++
 .../gcc.target/aarch64/sve/acle/asm/dup_u32.c | 60 
 .../gcc.target/aarch64/sve/acle/asm/dup_u64.c | 60 
 6 files changed, 384 insertions(+)

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/dup_s16.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/dup_s16.c
index 21ab6f63e37..9c91a5bbad9 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/dup_s16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/dup_s16.c
@@ -611,9 +611,15 @@ TEST_UNIFORM_Z (dup_127_s16_z, svint16_t,
 
 /*
 ** dup_128_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+\.h), #128
 ** sel z0\.h, p0, \2, \1\.h
+** |
+** mov (z[0-9]+\.h), #128
+** mov (z[0-9]+)\.b, #0
+** sel z0\.h, p0, \3, \4\.h
+** )
 ** ret
 */
 TEST_UNIFORM_Z (dup_128_s16_z, svint16_t,
@@ -632,9 +638,15 @@ TEST_UNIFORM_Z (dup_253_s16_z, svint16_t,
 
 /*
 ** dup_254_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+\.h), #254
 ** sel z0\.h, p0, \2, \1\.h
+** |
+** mov (z[0-9]+\.h), #254
+** mov (z[0-9]+)\.b, #0
+** sel z0\.h, p0, \3, \4\.h
+** )
 ** ret
 */
 TEST_UNIFORM_Z (dup_254_s16_z, svint16_t,
@@ -643,9 +655,15 @@ TEST_UNIFORM_Z (dup_254_s16_z, svint16_t,
 
 /*
 ** dup_255_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+\.h), #255
 ** sel z0\.h, p0, \2, \1\.h
+** |
+** mov (z[0-9]+\.h), #255
+** mov (z[0-9]+)\.b, #0
+** sel z0\.h, p0, \3, \4\.h
+** )
 ** ret
 */
 TEST_UNIFORM_Z (dup_255_s16_z, svint16_t,
@@ -663,9 +681,15 @@ TEST_UNIFORM_Z (dup_256_s16_z, svint16_t,
 
 /*
 ** dup_257_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+)\.b, #1
 ** sel z0\.h, p0, \2\.h, \1\.h
+** |
+** mov (z[0-9]+)\.b, #1
+** mov (z[0-9]+)\.b, #0
+** sel z0\.h, p0, \3\.h, \4\.h
+** )
 ** ret
 */
 TEST_UNIFORM_Z (dup_257_s16_z, svint16_t,
@@ -702,9 +726,15 @@ TEST_UNIFORM_Z (dup_7ffd_s16_z, svint16_t,
 
 /*
 ** dup_7ffe_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+\.h), #32766
 ** sel z0\.h, p0, \2, \1\.h
+** |
+** mov (z[0-9]+\.h), #32766
+** mov (z[0-9]+)\.b, #0
+** sel z0\.h, p0, \3, \4\.h
+** )
 ** ret
 */
 TEST_UNIFORM_Z (dup_7ffe_s16_z, svint16_t,
@@ -713,9 +743,15 @@ TEST_UNIFORM_Z (dup_7ffe_s16_z, svint16_t,
 
 /*
 ** dup_7fff_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+\.h), #32767
 ** sel z0\.h, p0, \2, \1\.h
+** |
+** mov (z[0-9]+\.h), #32767
+** mov (z[0-9]+)\.b, #0
+** sel z0\.h, p0, \3, \4\.h
+** )
 ** ret
 */
 TEST_UNIFORM_Z (dup_7fff_s16_z, svint16_t,
@@ -742,9 +778,15 @@ TEST_UNIFORM_Z (dup_m128_s16_z, svint16_t,
 
 /*
 ** dup_m129_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+\.h), #-129
 ** sel z0\.h, p0, \2, \1\.h
+** |
+** mov (z[0-9]+\.h), #-129
+** mov (z[0-9]+)\.b, #0
+** sel z0\.h, p0, \3, \4\.h
+** )
 ** ret
 */
 TEST_UNIFORM_Z (dup_m129_s16_z, svint16_t,
@@ -763,9 +805,15 @@ TEST_UNIFORM_Z (dup_m254_s16_z, svint16_t,
 
 /*
 ** dup_m255_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+\.h), #-255
 ** sel z0\.h, p0, \2, \1\.h
+** |
+** mov (z[0-9]+\.h), #-255
+** mov (z[0-9]+)\.b, #0
+** sel z0\.h, p0, \3, \4\.h
+** )
 ** ret
 */
 TEST_UNIFORM_Z (dup_m255_s16_z, svint16_t,
@@ -783,9 +831,15 @@ TEST_UNIFORM_Z (dup_m256_s16_z, svint16_t,
 
 /*
 ** dup_m257_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+\.h), #-257
 ** sel z0\.h, p0, \2, \1\.h
+** |
+** mov (z[0-9]+\.h), #-257
+** mov (z[0-9]+)\.b, #0
+** sel z0\.h, p0, \3, \4\.h
+** )
 ** ret
 */
 TEST_UNIFORM_Z (dup_m257_s16_z, svint16_t,
@@ -794,9 +848,15 @@ TEST_UNIFORM_Z (dup_m257_s16_z, svint16_t,
 
 /*
 ** dup_m258_s16_z:
+** (
 ** mov (z[0-9]+)\.b, #0
 ** mov (z[0-9]+)\.b, #-2
 ** sel z0\.h, p0, \2\.h, \1\.h

[PATCH 1/6] aarch64: Fix move-after-intrinsic function-body tests

2023-05-09 Thread Richard Sandiford via Gcc-patches
Some of the SVE ACLE asm tests tried to be agnostic about the
instruction order, but only one of the alternatives was exercised
in practice.  This patch fixes latent typos in the other versions.

gcc/testsuite/
* gcc.target/aarch64/sve2/acle/asm/aesd_u8.c: Fix expected register
allocation in the case where a move occurs after the intrinsic
instruction.
* gcc.target/aarch64/sve2/acle/asm/aese_u8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/aesimc_u8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/aesmc_u8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sm4e_u32.c: Likewise.
---
 gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesd_u8.c   | 4 ++--
 gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aese_u8.c   | 4 ++--
 gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesimc_u8.c | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesmc_u8.c  | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sm4e_u32.c  | 2 +-
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesd_u8.c 
b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesd_u8.c
index 622f5cf4609..384b6ffc9aa 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesd_u8.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesd_u8.c
@@ -28,13 +28,13 @@ TEST_UNIFORM_Z (aesd_u8_tied2, svuint8_t,
 ** mov z0\.d, z1\.d
 ** aesdz0\.b, z0\.b, z2\.b
 ** |
-** aesdz1\.b, z0\.b, z2\.b
+** aesdz1\.b, z1\.b, z2\.b
 ** mov z0\.d, z1\.d
 ** |
 ** mov z0\.d, z2\.d
 ** aesdz0\.b, z0\.b, z1\.b
 ** |
-** aesdz2\.b, z0\.b, z1\.b
+** aesdz2\.b, z2\.b, z1\.b
 ** mov z0\.d, z2\.d
 ** )
 ** ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aese_u8.c 
b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aese_u8.c
index 6555bbb1de7..6381bce1661 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aese_u8.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aese_u8.c
@@ -28,13 +28,13 @@ TEST_UNIFORM_Z (aese_u8_tied2, svuint8_t,
 ** mov z0\.d, z1\.d
 ** aesez0\.b, z0\.b, z2\.b
 ** |
-** aesez1\.b, z0\.b, z2\.b
+** aesez1\.b, z1\.b, z2\.b
 ** mov z0\.d, z1\.d
 ** |
 ** mov z0\.d, z2\.d
 ** aesez0\.b, z0\.b, z1\.b
 ** |
-** aesez2\.b, z0\.b, z1\.b
+** aesez2\.b, z2\.b, z1\.b
 ** mov z0\.d, z2\.d
 ** )
 ** ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesimc_u8.c 
b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesimc_u8.c
index 4630595ff20..76259326467 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesimc_u8.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesimc_u8.c
@@ -19,7 +19,7 @@ TEST_UNIFORM_Z (aesimc_u8_tied1, svuint8_t,
 ** mov z0\.d, z1\.d
 ** aesimc  z0\.b, z0\.b
 ** |
-** aesimc  z1\.b, z0\.b
+** aesimc  z1\.b, z1\.b
 ** mov z0\.d, z1\.d
 ** )
 ** ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesmc_u8.c 
b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesmc_u8.c
index 6e8acf48f2a..30e83d381dc 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesmc_u8.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aesmc_u8.c
@@ -19,7 +19,7 @@ TEST_UNIFORM_Z (aesmc_u8_tied1, svuint8_t,
 ** mov z0\.d, z1\.d
 ** aesmc   z0\.b, z0\.b
 ** |
-** aesmc   z1\.b, z0\.b
+** aesmc   z1\.b, z1\.b
 ** mov z0\.d, z1\.d
 ** )
 ** ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sm4e_u32.c 
b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sm4e_u32.c
index 0ff5746d814..cf6a2a95235 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sm4e_u32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sm4e_u32.c
@@ -24,7 +24,7 @@ TEST_UNIFORM_Z (sm4e_u32_tied2, svuint32_t,
 ** mov z0\.d, z1\.d
 ** sm4ez0\.s, z0\.s, z2\.s
 ** |
-** sm4ez1\.s, z0\.s, z2\.s
+** sm4ez1\.s, z1\.s, z2\.s
 ** mov z0\.d, z1\.d
 ** )
 ** ret
-- 
2.25.1



[PATCH 0/6] aarch64: Avoid hard-coding specific register allocations

2023-05-09 Thread Richard Sandiford via Gcc-patches
I have a patch that seems to improve register allocation for SIMD
lane operations, and for similar instructions that require a reduced
register range.  However, it showed that a lot of asm tests are
sensitive to the current register allocation.  This patch series
tries to correct the affected cases.  Putting it in first is an
attempt to “prove” that the new tests work both ways.

Tested on aarch64-linux-gnu and pushed.

Richard


Richard Sandiford (6):
  aarch64: Fix move-after-intrinsic function-body tests
  aarch64: Allow moves after tied-register intrinsics
  aarch64: Relax ordering requirements in SVE dup tests
  aarch64: Relax predicate register matches
  aarch64: Relax FP/vector register matches
  aarch64: Avoid hard-coding specific register allocations

 .../g++.target/aarch64/sve/vcond_1.C  | 258 +-
 .../advsimd-intrinsics/bfcvtnq2-untied.c  |   5 +
 .../aarch64/advsimd-intrinsics/bfdot-1.c  |  10 +
 .../aarch64/advsimd-intrinsics/vdot-3-1.c |  10 +
 .../aarch64/advsimd-intrinsics/vshl-opt-6.c   |   2 +-
 .../gcc.target/aarch64/asimd-mul-to-shl-sub.c |   4 +-
 .../gcc.target/aarch64/asm-x-constraint-1.c   |   4 +-
 .../gcc.target/aarch64/auto-init-padding-1.c  |   2 +-
 .../gcc.target/aarch64/auto-init-padding-2.c  |   3 +-
 .../gcc.target/aarch64/auto-init-padding-3.c  |   3 +-
 .../gcc.target/aarch64/auto-init-padding-4.c  |   3 +-
 .../gcc.target/aarch64/auto-init-padding-9.c  |   2 +-
 .../gcc.target/aarch64/fmul_fcvt_2.c  |   6 +-
 gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c |   2 +-
 gcc/testsuite/gcc.target/aarch64/ldp_stp_21.c |   2 +-
 gcc/testsuite/gcc.target/aarch64/ldp_stp_8.c  |   2 +-
 .../gcc.target/aarch64/memset-corner-cases.c  |  22 +-
 .../gcc.target/aarch64/memset-q-reg.c |  22 +-
 .../gcc.target/aarch64/simd/vaddlv_1.c|  24 +-
 .../gcc.target/aarch64/simd/vpaddd_f64.c  |   2 +-
 .../gcc.target/aarch64/simd/vpaddd_s64.c  |   2 +-
 .../gcc.target/aarch64/simd/vpaddd_u64.c  |   2 +-
 .../gcc.target/aarch64/sve-neon-modes_1.c |   4 +-
 .../gcc.target/aarch64/sve-neon-modes_3.c |  16 +-
 .../aarch64/sve/acle/asm/adda_f16.c   |   5 +
 .../aarch64/sve/acle/asm/adda_f32.c   |   5 +
 .../aarch64/sve/acle/asm/adda_f64.c   |   5 +
 .../gcc.target/aarch64/sve/acle/asm/brka_b.c  |   5 +
 .../gcc.target/aarch64/sve/acle/asm/brkb_b.c  |   5 +
 .../gcc.target/aarch64/sve/acle/asm/brkn_b.c  |   5 +
 .../aarch64/sve/acle/asm/clasta_bf16.c|   5 +
 .../aarch64/sve/acle/asm/clasta_f16.c |   5 +
 .../aarch64/sve/acle/asm/clasta_f32.c |   5 +
 .../aarch64/sve/acle/asm/clasta_f64.c |   5 +
 .../aarch64/sve/acle/asm/clastb_bf16.c|   5 +
 .../aarch64/sve/acle/asm/clastb_f16.c |   5 +
 .../aarch64/sve/acle/asm/clastb_f32.c |   5 +
 .../aarch64/sve/acle/asm/clastb_f64.c |   5 +
 .../gcc.target/aarch64/sve/acle/asm/dup_s16.c |  72 +
 .../gcc.target/aarch64/sve/acle/asm/dup_s32.c |  60 
 .../gcc.target/aarch64/sve/acle/asm/dup_s64.c |  60 
 .../gcc.target/aarch64/sve/acle/asm/dup_u16.c |  72 +
 .../gcc.target/aarch64/sve/acle/asm/dup_u32.c |  60 
 .../gcc.target/aarch64/sve/acle/asm/dup_u64.c |  60 
 .../aarch64/sve/acle/asm/dupq_b16.c   |  86 +++---
 .../aarch64/sve/acle/asm/dupq_b32.c   |  48 ++--
 .../aarch64/sve/acle/asm/dupq_b64.c   |  16 +-
 .../gcc.target/aarch64/sve/acle/asm/dupq_b8.c | 136 -
 .../aarch64/sve/acle/asm/pfirst_b.c   |   5 +
 .../aarch64/sve/acle/asm/pnext_b16.c  |   5 +
 .../aarch64/sve/acle/asm/pnext_b32.c  |   5 +
 .../aarch64/sve/acle/asm/pnext_b64.c  |   5 +
 .../aarch64/sve/acle/asm/pnext_b8.c   |   5 +
 .../aarch64/sve/acle/general/whilele_10.c |   2 +-
 .../aarch64/sve/acle/general/whilele_5.c  |  10 +-
 .../aarch64/sve/acle/general/whilele_6.c  |   2 +-
 .../aarch64/sve/acle/general/whilele_7.c  |   6 +-
 .../aarch64/sve/acle/general/whilele_9.c  |   6 +-
 .../aarch64/sve/acle/general/whilelt_1.c  |  10 +-
 .../aarch64/sve/acle/general/whilelt_2.c  |   2 +-
 .../aarch64/sve/acle/general/whilelt_3.c  |   6 +-
 gcc/testsuite/gcc.target/aarch64/sve/adr_1.c  |  24 +-
 gcc/testsuite/gcc.target/aarch64/sve/adr_2.c  |  24 +-
 gcc/testsuite/gcc.target/aarch64/sve/adr_3.c  |  24 +-
 gcc/testsuite/gcc.target/aarch64/sve/adr_4.c  |   6 +-
 gcc/testsuite/gcc.target/aarch64/sve/adr_5.c  |  16 +-
 .../gcc.target/aarch64/sve/extract_1.c|   4 +-
 .../gcc.target/aarch64/sve/extract_2.c|   4 +-
 .../gcc.target/aarch64/sve/extract_3.c|   4 +-
 .../gcc.target/aarch64/sve/extract_4.c|   4 +-
 .../aarch64/sve/load_scalar_offset_1.c|   8 +-
 .../aarch64/sve/mask_gather_load_6.c  |   4 +-
 .../aarch64/sve/pcs/args_5_be_bf16.c  |  18 +-
 .../aarch64/sve/pcs/args_5_be_f16.c   |  18 +-
 .../aarch64/sve/pcs/args_5_be_f32.c   |  18 +-
 

[PATCH] ira: Don't create copies for earlyclobbered pairs

2023-05-05 Thread Richard Sandiford via Gcc-patches
This patch follows on from g:9f635bd13fe9e85872e441b6f3618947f989909a
("the previous patch").  To start by quoting that:

If an insn requires two operands to be tied, and the input operand dies
in the insn, IRA acts as though there were a copy from the input to the
output with the same execution frequency as the insn.  Allocating the
same register to the input and the output then saves the cost of a move.

If there is no such tie, but an input operand nevertheless dies
in the insn, IRA creates a similar move, but with an eighth of the
frequency.  This helps to ensure that chains of instructions reuse
registers in a natural way, rather than using arbitrarily different
registers for no reason.

This heuristic seems to work well in the vast majority of cases.
However, the problem fixed in the previous patch was that we
could create a copy for an operand pair even if, for all relevant
alternatives, the output and input register classes did not have
any registers in common.  It is then impossible for the output
operand to reuse the dying input register.

This left unfixed a further case where copies don't make sense:
there is no point trying to reuse the dying input register if,
for all relevant alternatives, the output is earlyclobbered and
the input doesn't match the output.  (Matched earlyclobbers are fine.)

Handling that case fixes several existing XFAILs and helps with
a follow-on aarch64 patch.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.  A SPEC2017 run
on aarch64 showed no differences outside the noise.  Also, I tried
compiling gcc.c-torture, gcc.dg, and g++.dg for at least one target
per cpu directory, using the options -Os -fno-schedule-insns{,2}.
The results below summarise the tests that showed a difference in LOC:

Target   Tests   GoodBad   DeltaBest   Worst  Median
==   =   ===   =   =  ==
amdgcn-amdhsa   14  7  7   3 -18  10  -1
arm-linux-gnueabihf 16 15  1 -22  -4   2  -1
csky-elf 6  6  0 -21  -6  -2  -4
hppa64-hp-hpux11.23  5  5  0  -7  -2  -1  -1
ia64-linux-gnu  16 16  0 -70 -15  -1  -3
m32r-elf53  1 52  64  -2   8   1
mcore-elf2  2  0  -8  -6  -2  -6
microblaze-elf 285283  2-909 -68   4  -1
mmix 7  7  0   -2101   -2091  -1  -1
msp430-elf   1  1  0  -4  -4  -4  -4
pru-elf  8  6  2 -12  -6   2  -2
rx-elf  22 18  4 -40  -5   6  -2
sparc-linux-gnu 15 14  1 -40  -8   1  -2
sparc-wrs-vxworks   15 14  1 -40  -8   1  -2
visium-elf   2  1  1   0  -2   2  -2
xstormy16-elf1  1  0  -2  -2  -2  -2

with other targets showing no sensitivity to the patch.  The only
target that seems to be negatively affected is m32r-elf; otherwise
the patch seems like an extremely minor but still clear improvement.

OK to install?

Richard


gcc/
* ira-conflicts.cc (can_use_same_reg_p): Skip over non-matching
earlyclobbers.

gcc/testsuite/
* gcc.target/aarch64/sve/acle/asm/asr_wide_s16.c: Remove XFAILs.
* gcc.target/aarch64/sve/acle/asm/asr_wide_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/asr_wide_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/bic_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/bic_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/bic_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/bic_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsr_wide_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsr_wide_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsr_wide_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/scale_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/scale_f64.c: Likewise.
---
 gcc/ira-conflicts.cc | 3 +++
 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/asr_wide_s16.c | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/asr_wide_s32.c | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/asr_wide_s8.c  | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/bic_s32.c  | 2 +-
 

Re: [PATCH 2/3] Refactor widen_plus as internal_fn

2023-05-03 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Fri, 28 Apr 2023, Andre Vieira (lists) wrote:
>
>> This patch replaces the existing tree_code widen_plus and widen_minus
>> patterns with internal_fn versions.
>> 
>> DEF_INTERNAL_OPTAB_HILO_FN is like DEF_INTERNAL_OPTAB_FN except it provides
>> convenience wrappers for defining conversions that require a hi/lo split, 
>> like
>> widening and narrowing operations.  Each definition for  will require 
>> an
>> optab named  and two other optabs that you specify for signed and
>> unsigned. The hi/lo pair is necessary because the widening operations take n
>> narrow elements as inputs and return n/2 wide elements as outputs. The 'lo'
>> operation operates on the first n/2 elements of input. The 'hi' operation
>> operates on the second n/2 elements of input. Defining an internal_fn along
>> with hi/lo variations allows a single internal function to be returned from a
>> vect_recog function that will later be expanded to hi/lo.
>> 
>> DEF_INTERNAL_OPTAB_HILO_FN is used in internal-fn.def to register a widening
>> internal_fn. It is defined differently in different places and 
>> internal-fn.def
>> is sourced from those places so the parameters given can be reused.
>>   internal-fn.c: defined to expand to hi/lo signed/unsigned optabs, later
>> defined to generate the  'expand_' functions for the hi/lo versions of the 
>> fn.
>>   internal-fn.def: defined to invoke DEF_INTERNAL_OPTAB_FN for the original
>> and hi/lo variants of the internal_fn
>> 
>>  For example:
>>  IFN_VEC_WIDEN_PLUS -> IFN_VEC_WIDEN_PLUS_HI, IFN_VEC_WIDEN_PLUS_LO
>> for aarch64: IFN_VEC_WIDEN_PLUS_HI   -> vec_widen_addl_hi_ ->
>> (u/s)addl2
>>IFN_VEC_WIDEN_PLUS_LO  -> vec_widen_addl_lo_
>> -> (u/s)addl
>> 
>> This gives the same functionality as the previous WIDEN_PLUS/WIDEN_MINUS tree
>> codes which are expanded into VEC_WIDEN_PLUS_LO, VEC_WIDEN_PLUS_HI.
>
> I'll note that it's interesting we have widen multiplication as
> the only existing example where we have both HI/LO and EVEN/ODD cases.
> I think we want to share as much of the infrastructure to eventually
> support targets doing even/odd (I guess all VLA vector targets will
> be even/odd?).

Can't speak for all, but SVE2 certainly is.

> DEF_INTERNAL_OPTAB_HILO_FN also looks to be implicitely directed to
> widening operations (otherwise no signed/unsigned variants would be
> necessary).  What I don't understand is why we need an optab
> without _hi/_lo but in that case no signed/unsigned variant?
>
> Looks like all plus, plus_lo and plus_hi are commutative but
> only plus is widening?!  So is the setup that the vectorizer
> doesn't know about the split and uses 'plus' but then the
> expander performs the split?  It does look a bit awkward here
> (the plain 'plus' is just used for the scalar case during
> pattern recog it seems).
>
> I'd rather have DEF_INTERNAL_OPTAB_HILO_FN split up, declaring
> the hi/lo pairs and the scalar variant separately using
> DEF_INTERNAL_FN without expander for that, and having
> DEF_INTERNAL_HILO_WIDEN_OPTAB_FN and DEF_INTERNAL_EVENODD_WIDEN_OPTAB_FN
> for the signed/unsigned pairs?  (if we need that helper at all)
>
> Targets shouldn't need to implement the plain optab (it shouldn't
> exist) and the vectorizer should query the hi/lo or even/odd
> optabs for support instead.

I dread these kinds of review because I think I'm almost certain to
flatly contradict something I said last time round, but +1 FWIW.
It seems OK to define an ifn to represent the combined effect, for the
scalar case, but that shouldn't leak into optabs unless we actually want
to use the ifn for "real" scalar ops (as opposed to a temporary
placeholder during pattern recognition).

On the optabs/ifn bits:

> +static int
> +ifn_cmp (const void *a_, const void *b_)
> +{
> +  typedef std::pair ifn_pair;
> +  auto *a = (const std::pair *)a_;
> +  auto *b = (const std::pair *)b_;
> +  return (int) (a->first.first) - (b->first.first);
> +}
> +
> +/* Return the optab belonging to the given internal function NAME for the 
> given
> +   SIGN or unknown_optab.  */
> +
> +optab
> +lookup_hilo_ifn_optab (enum internal_fn fn, unsigned sign)

There is no NAME parameter.  It also isn't clear what SIGN means:
is 1 for unsigned or signed?  Would be better to use signop and
TYPE_SIGN IMO.

> +{
> +  typedef std::pair ifn_pair;
> +  typedef auto_vec >fn_to_optab_map_type;
> +  static fn_to_optab_map_type *fn_to_optab_map;
> +
> +  if (!fn_to_optab_map)
> +{
> +  unsigned num
> + = sizeof (internal_fn_hilo_keys_array) / sizeof (enum internal_fn);
> +  fn_to_optab_map = new fn_to_optab_map_type ();
> +  for (unsigned int i = 0; i < num - 1; ++i)
> + {
> +   enum internal_fn fn = internal_fn_hilo_keys_array[i];
> +   optab v1 = internal_fn_hilo_values_array[2*i];
> +   optab v2 = internal_fn_hilo_values_array[2*i + 1];
> +   ifn_pair key1 (fn, 0);
> +   fn_to_optab_map->safe_push ({key1, v1});
> +   ifn_pair 

Re: [PATCH v2] MIPS: add speculation_barrier support

2023-05-03 Thread Richard Sandiford via Gcc-patches
YunQiang Su  writes:
> speculation_barrier for MIPS needs sync+jr.hb (r2+),
> so we implement __speculation_barrier in libgcc, like arm32 does.

Looks reasonable, but do you have a source for the fallback
pre-r2 handling?  (Thanks for adding that btw, since I realise
it's not your focus here.)

Nit: the copyright for the new files should start with this year,
unless you're copying something significant from an existing file.

Thanks,
Richard

>
> gcc/ChangeLog:
>   * config/mips/mips-protos.h (mips_emit_speculation_barrier): New
> prototype.
>   * config/mips/mips.cc (speculation_barrier_libfunc): New static
> variable.
>   (mips_init_libfuncs): Initialize it.
>   (mips_emit_speculation_barrier): New function.
>   * config/mips/mips.md (speculation_barrier): Call
> mips_emit_speculation_barrier.
>
> libgcc/ChangeLog:
>   * config/mips/lib1funcs.S: New file.
>   define __speculation_barrier and include mips16.S.
>   * config/mips/t-mips: define LIB1ASMSRC as mips/lib1funcs.S.
>   define LIB1ASMFUNCS as _speculation_barrier.
>   set version info for __speculation_barrier.
>   * config/mips/libgcc-mips.ver: New file.
>   * config/mips/t-mips16: don't define LIB1ASMSRC as mips16.S is
>   included in lib1funcs.S now.
> ---
>  gcc/config/mips/mips-protos.h  |  2 +
>  gcc/config/mips/mips.cc| 13 +++
>  gcc/config/mips/mips.md| 12 ++
>  libgcc/config/mips/lib1funcs.S | 60 ++
>  libgcc/config/mips/libgcc-mips.ver | 21 +++
>  libgcc/config/mips/t-mips  |  7 
>  libgcc/config/mips/t-mips16|  3 +-
>  7 files changed, 116 insertions(+), 2 deletions(-)
>  create mode 100644 libgcc/config/mips/lib1funcs.S
>  create mode 100644 libgcc/config/mips/libgcc-mips.ver
>
> diff --git a/gcc/config/mips/mips-protos.h b/gcc/config/mips/mips-protos.h
> index 20483469105..da7902c235b 100644
> --- a/gcc/config/mips/mips-protos.h
> +++ b/gcc/config/mips/mips-protos.h
> @@ -388,4 +388,6 @@ extern void mips_register_frame_header_opt (void);
>  extern void mips_expand_vec_cond_expr (machine_mode, machine_mode, rtx *);
>  extern void mips_expand_vec_cmp_expr (rtx *);
>  
> +extern void mips_emit_speculation_barrier_function (void);
> +
>  #endif /* ! GCC_MIPS_PROTOS_H */
> diff --git a/gcc/config/mips/mips.cc b/gcc/config/mips/mips.cc
> index ca822758b41..139707fda34 100644
> --- a/gcc/config/mips/mips.cc
> +++ b/gcc/config/mips/mips.cc
> @@ -13611,6 +13611,9 @@ mips_autovectorize_vector_modes (vector_modes *modes, 
> bool)
>return 0;
>  }
>  
> +
> +static GTY(()) rtx speculation_barrier_libfunc;
> +
>  /* Implement TARGET_INIT_LIBFUNCS.  */
>  
>  static void
> @@ -13680,6 +13683,7 @@ mips_init_libfuncs (void)
>synchronize_libfunc = init_one_libfunc ("__sync_synchronize");
>init_sync_libfuncs (UNITS_PER_WORD);
>  }
> +  speculation_barrier_libfunc = init_one_libfunc ("__speculation_barrier");
>  }
>  
>  /* Build up a multi-insn sequence that loads label TARGET into $AT.  */
> @@ -19092,6 +19096,15 @@ mips_avoid_hazard (rtx_insn *after, rtx_insn *insn, 
> int *hilo_delay,
>}
>  }
>  
> +/* Emit a speculation barrier.
> +   JR.HB is needed, so we need to put
> +   speculation_barrier_libfunc in libgcc */
> +void
> +mips_emit_speculation_barrier_function ()
> +{
> +  emit_library_call (speculation_barrier_libfunc, LCT_NORMAL, VOIDmode);
> +}
> +
>  /* A SEQUENCE is breakable iff the branch inside it has a compact form
> and the target has compact branches.  */
>  
> diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
> index ac1d77afc7d..5d04ac566dd 100644
> --- a/gcc/config/mips/mips.md
> +++ b/gcc/config/mips/mips.md
> @@ -160,6 +160,8 @@
>;; The `.insn' pseudo-op.
>UNSPEC_INSN_PSEUDO
>UNSPEC_JRHB
> +
> +  VUNSPEC_SPECULATION_BARRIER
>  ])
>  
>  (define_constants
> @@ -7455,6 +7457,16 @@
>mips_expand_conditional_move (operands);
>DONE;
>  })
> +
> +(define_expand "speculation_barrier"
> +  [(unspec_volatile [(const_int 0)] VUNSPEC_SPECULATION_BARRIER)]
> +  ""
> +  "
> +  mips_emit_speculation_barrier_function ();
> +  DONE;
> +  "
> +)
> +
>  
>  ;;
>  ;;  
> diff --git a/libgcc/config/mips/lib1funcs.S b/libgcc/config/mips/lib1funcs.S
> new file mode 100644
> index 000..45d74e2e762
> --- /dev/null
> +++ b/libgcc/config/mips/lib1funcs.S
> @@ -0,0 +1,60 @@
> +/* Copyright (C) 1995-2023 Free Software Foundation, Inc.
> +
> +This file is free software; you can redistribute it and/or modify it
> +under the terms of the GNU General Public License as published by the
> +Free Software Foundation; either version 3, or (at your option) any
> +later version.
> +
> +This file is distributed in the hope that it will be useful, but
> +WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +General Public License 

[PATCH 2/2] aarch64: Fix ABI handling of aligned enums [PR109661]

2023-05-03 Thread Richard Sandiford via Gcc-patches
aarch64_function_arg_alignment has traditionally taken the alignment
of a scalar type T from TYPE_ALIGN (TYPE_MAIN_VARIANT (T)).  This is
supposed to discard any user alignment and give the alignment of the
underlying fundamental type.

PR109661 shows that this did the wrong thing for enums with
a defined underlying type, because:

(1) The enum itself could be aligned, using attributes.
(2) The enum would pick up any user alignment on the underlying type.

We get the right behaviour if we look at the TYPE_MAIN_VARIANT
of the underlying type instead.

As always, this affects register and stack arguments differently,
because:

(a) The code that handles register arguments only considers the
alignment of types that occupy two registers, whereas the
stack alignment is applied regardless of size.

(b) The code that handles register arguments tests the alignment
for equality with 16 bytes, so that (unexpected) greater alignments
are ignored.  The code that handles stack arguments instead caps the
alignment to 16 bytes.

There is now (since GCC 13) an assert to trap the difference between
(a) and (b), which is how the new incompatiblity showed up.

Clang alredy handled the testcases correctly, so this patch aligns
the GCC behaviour with the Clang behaviour.

I'm planning to remove the asserts on the branches, since we don't
want to change the ABI there.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
PR target/109661
* config/aarch64/aarch64.cc (aarch64_function_arg_alignment): Add
a new ABI break parameter for GCC 14.  Set it to the alignment
of enums that have an underlying type.  Take the true alignment
of such enums from the TYPE_ALIGN of the underlying type's
TYPE_MAIN_VARIANT.
(aarch64_function_arg_boundary): Update accordingly.
(aarch64_layout_arg, aarch64_gimplify_va_arg_expr): Likewise.
Warn about ABI differences.

gcc/testsuite/
* g++.target/aarch64/pr109661-1.C: New test.
* g++.target/aarch64/pr109661-2.C: Likewise.
* g++.target/aarch64/pr109661-3.C: Likewise.
* g++.target/aarch64/pr109661-4.C: Likewise.
* gcc.target/aarch64/pr109661-1.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc |  43 ++-
 gcc/testsuite/g++.target/aarch64/pr109661-1.C | 253 ++
 gcc/testsuite/g++.target/aarch64/pr109661-2.C | 253 ++
 gcc/testsuite/g++.target/aarch64/pr109661-3.C | 253 ++
 gcc/testsuite/g++.target/aarch64/pr109661-4.C | 253 ++
 gcc/testsuite/gcc.target/aarch64/pr109661-1.c |  11 +
 6 files changed, 1061 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/aarch64/pr109661-1.C
 create mode 100644 gcc/testsuite/g++.target/aarch64/pr109661-2.C
 create mode 100644 gcc/testsuite/g++.target/aarch64/pr109661-3.C
 create mode 100644 gcc/testsuite/g++.target/aarch64/pr109661-4.C
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr109661-1.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 70916ad63d2..546cb121331 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -7467,17 +7467,21 @@ aarch64_vfp_is_call_candidate (cumulative_args_t 
pcum_v, machine_mode mode,
4.1).  ABI_BREAK_GCC_9 is set to the old alignment if the alignment
was incorrectly calculated in versions of GCC prior to GCC 9.
ABI_BREAK_GCC_13 is set to the old alignment if it was incorrectly
-   calculated in versions between GCC 9 and GCC 13.
+   calculated in versions between GCC 9 and GCC 13.  If the alignment
+   might have changed between GCC 13 and GCC 14, ABI_BREAK_GCC_14
+   is the old GCC 13 alignment, otherwise it is zero.
 
This is a helper function for local use only.  */
 
 static unsigned int
 aarch64_function_arg_alignment (machine_mode mode, const_tree type,
unsigned int *abi_break_gcc_9,
-   unsigned int *abi_break_gcc_13)
+   unsigned int *abi_break_gcc_13,
+   unsigned int *abi_break_gcc_14)
 {
   *abi_break_gcc_9 = 0;
   *abi_break_gcc_13 = 0;
+  *abi_break_gcc_14 = 0;
   if (!type)
 return GET_MODE_ALIGNMENT (mode);
 
@@ -7498,6 +7502,11 @@ aarch64_function_arg_alignment (machine_mode mode, 
const_tree type,
  gcc_assert (known_eq (POINTER_SIZE, GET_MODE_BITSIZE (mode)));
  return POINTER_SIZE;
}
+  if (TREE_CODE (type) == ENUMERAL_TYPE && TREE_TYPE (type))
+   {
+ *abi_break_gcc_14 = TYPE_ALIGN (type);
+ type = TYPE_MAIN_VARIANT (TREE_TYPE (type));
+   }
   gcc_assert (!TYPE_USER_ALIGN (type));
   return TYPE_ALIGN (type);
 }
@@ -7576,6 +7585,7 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
function_arg_info )
   HOST_WIDE_INT size;
   unsigned int abi_break_gcc_9;
   unsigned int abi_break_gcc_13;
+  unsigned int abi_break_gcc_14;
 
   

[PATCH 1/2] aarch64: Rename abi_break parameters [PR109661]

2023-05-03 Thread Richard Sandiford via Gcc-patches
aarch64_function_arg_alignment has two related abi_break
parameters: abi_break for a change in GCC 9, and abi_break_packed
for a related follow-on change in GCC 13.  In a sense, abi_break_packed
is a "subfix" of abi_break.

PR109661 now requires a third ABI break that is independent
of the other two.  Having abi_break for the GCC 9 break and
abi_break_ for the GCC 13 and GCC 14 breaks might
give the impression that they're all related, and that the GCC 14
fix (like the GCC 13 fix) is a "subfix" of the GCC 9 one.
It therefore seemed like a good idea to rename the existing
variables first.

It would be difficult to choose names that describe briefly and
precisely what went wrong in each case.  The next best thing
seemed to be to name them after the relevant GCC version.
(Of course, this might break down in future if we need two
independent fixes in the same version.  Let's hope not.)

I wondered about putting all the variables in a structure,
but one advantage of using independent variables is that it's
harder to forget to update a caller.  Maybe a fourth parameter
would be a tipping point.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
PR target/109661
* config/aarch64/aarch64.cc (aarch64_function_arg_alignment): Rename
ABI break variables to abi_break_gcc_9 and abi_break_gcc_13.
(aarch64_layout_arg, aarch64_function_arg_boundary): Likewise.
(aarch64_gimplify_va_arg_expr): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 70 ++-
 1 file changed, 36 insertions(+), 34 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 2b0de7ca038..70916ad63d2 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -7464,19 +7464,20 @@ aarch64_vfp_is_call_candidate (cumulative_args_t 
pcum_v, machine_mode mode,
 /* Given MODE and TYPE of a function argument, return the alignment in
bits.  The idea is to suppress any stronger alignment requested by
the user and opt for the natural alignment (specified in AAPCS64 \S
-   4.1).  ABI_BREAK is set to the old alignment if the alignment was
-   incorrectly calculated in versions of GCC prior to GCC-9.
-   ABI_BREAK_PACKED is set to the old alignment if it was incorrectly
-   calculated in versions between GCC-9 and GCC-13.  This is a helper
-   function for local use only.  */
+   4.1).  ABI_BREAK_GCC_9 is set to the old alignment if the alignment
+   was incorrectly calculated in versions of GCC prior to GCC 9.
+   ABI_BREAK_GCC_13 is set to the old alignment if it was incorrectly
+   calculated in versions between GCC 9 and GCC 13.
+
+   This is a helper function for local use only.  */
 
 static unsigned int
 aarch64_function_arg_alignment (machine_mode mode, const_tree type,
-   unsigned int *abi_break,
-   unsigned int *abi_break_packed)
+   unsigned int *abi_break_gcc_9,
+   unsigned int *abi_break_gcc_13)
 {
-  *abi_break = 0;
-  *abi_break_packed = 0;
+  *abi_break_gcc_9 = 0;
+  *abi_break_gcc_13 = 0;
   if (!type)
 return GET_MODE_ALIGNMENT (mode);
 
@@ -7547,11 +7548,11 @@ aarch64_function_arg_alignment (machine_mode mode, 
const_tree type,
  'packed' attribute into account.  */
   if (bitfield_alignment != bitfield_alignment_with_packed
   && bitfield_alignment_with_packed > alignment)
-*abi_break_packed = bitfield_alignment_with_packed;
+*abi_break_gcc_13 = bitfield_alignment_with_packed;
 
   if (bitfield_alignment > alignment)
 {
-  *abi_break = alignment;
+  *abi_break_gcc_9 = alignment;
   return bitfield_alignment;
 }
 
@@ -7573,8 +7574,8 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
function_arg_info )
   int ncrn, nvrn, nregs;
   bool allocate_ncrn, allocate_nvrn;
   HOST_WIDE_INT size;
-  unsigned int abi_break;
-  unsigned int abi_break_packed;
+  unsigned int abi_break_gcc_9;
+  unsigned int abi_break_gcc_13;
 
   /* We need to do this once per argument.  */
   if (pcum->aapcs_arg_processed)
@@ -7612,7 +7613,7 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
function_arg_info )
 
  Versions prior to GCC 9.1 ignored a bitfield's underlying type
  and so could calculate an alignment that was too small.  If this
- happened for TYPE then ABI_BREAK is this older, too-small alignment.
+ happened for TYPE then ABI_BREAK_GCC_9 is this older, too-small alignment.
 
  Although GCC 9.1 fixed that bug, it introduced a different one:
  it would consider the alignment of a bitfield's underlying type even
@@ -7620,7 +7621,7 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
function_arg_info )
  the alignment of the underlying type).  This was fixed in GCC 13.1.
 
  As a result of this bug, GCC 9 to GCC 12 could calculate an alignment
- that was too big.  If this happened for TYPE, ABI_BREAK_PACKED is
+ that was 

Re: [aarch64] Code-gen for vector initialization involving constants

2023-05-02 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Tue, 2 May 2023 at 17:32, Richard Sandiford
>  wrote:
>>
>> Prathamesh Kulkarni  writes:
>> > On Tue, 2 May 2023 at 14:56, Richard Sandiford
>> >  wrote:
>> >> > [aarch64] Improve code-gen for vector initialization with single 
>> >> > constant element.
>> >> >
>> >> > gcc/ChangeLog:
>> >> >   * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak 
>> >> > condition
>> >> >   if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >> >   and if maxv == 1, use constant element for duplicating into 
>> >> > register.
>> >> >
>> >> > gcc/testsuite/ChangeLog:
>> >> >   * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >> >
>> >> > diff --git a/gcc/config/aarch64/aarch64.cc 
>> >> > b/gcc/config/aarch64/aarch64.cc
>> >> > index 2b0de7ca038..f46750133a6 100644
>> >> > --- a/gcc/config/aarch64/aarch64.cc
>> >> > +++ b/gcc/config/aarch64/aarch64.cc
>> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx 
>> >> > vals)
>> >> >   and matches[X][1] with the count of duplicate elements (if X is 
>> >> > the
>> >> >   earliest element which has duplicates).  */
>> >> >
>> >> > -  if (n_var == n_elts && n_elts <= 16)
>> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>> >> >  {
>> >> >int matches[16][2] = {0};
>> >> >for (int i = 0; i < n_elts; i++)
>> >> > @@ -7,6 +7,18 @@ aarch64_expand_vector_init (rtx target, rtx 
>> >> > vals)
>> >> >vector register.  For big-endian we want that position to 
>> >> > hold
>> >> >the last element of VALS.  */
>> >> > maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> >> > +
>> >> > +   /* If we have a single constant element, use that for 
>> >> > duplicating
>> >> > +  instead.  */
>> >> > +   if (n_var == n_elts - 1)
>> >> > + for (int i = 0; i < n_elts; i++)
>> >> > +   if (CONST_INT_P (XVECEXP (vals, 0, i))
>> >> > +   || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> >> > + {
>> >> > +   maxelement = i;
>> >> > +   break;
>> >> > + }
>> >> > +
>> >> > rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> >> > aarch64_emit_move (target, lowpart_subreg (mode, x, 
>> >> > inner_mode));
>> >>
>> >> We don't want to force the constant into a register though.
>> > OK right, sorry.
>> > With the attached patch, for the following test-case:
>> > int64x2_t f_s64(int64_t x)
>> > {
>> >   return (int64x2_t) { x, 1 };
>> > }
>> >
>> > it loads constant from memory (same code-gen as without patch).
>> > f_s64:
>> > adrpx1, .LC0
>> > ldr q0, [x1, #:lo12:.LC0]
>> > ins v0.d[0], x0
>> > ret
>> >
>> > Does the patch look OK ?
>> >
>> > Thanks,
>> > Prathamesh
>> > [...]
>> > [aarch64] Improve code-gen for vector initialization with single constant 
>> > element.
>> >
>> > gcc/ChangeLog:
>> >   * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak 
>> > condition
>> >   if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >   and if maxv == 1, use constant element for duplicating into register.
>> >
>> > gcc/testsuite/ChangeLog:
>> >   * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >
>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> > index 2b0de7ca038..97309ddec4f 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >   and matches[X][1] with the count of duplicate elements (if X is the
>> >   earliest element which has duplicates).  */
>> >
>> > -  if (n_var == n_elts && n_elts <= 16)
>> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>>
>> No need for the extra brackets.
> Adjusted, thanks. Sorry if this sounds like a silly question, but why
> do we need the n_elts <= 16 check ?
> Won't n_elts be always <= 16 since max number of elements in a vector
> would be 16 for V16QI ?

Was wondering the same thing :)

Let's leave it though.

>> >  {
>> >int matches[16][2] = {0};
>> >for (int i = 0; i < n_elts; i++)
>> > @@ -7,8 +7,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >vector register.  For big-endian we want that position to hold
>> >the last element of VALS.  */
>> > maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> > -   rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> > -   aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>> > +
>> > +   /* If we have a single constant element, use that for duplicating
>> > +  instead.  */
>> > +   if (n_var == n_elts - 1)
>> > + for (int i = 0; i < n_elts; i++)
>> > +   if (CONST_INT_P (XVECEXP (vals, 0, i))
>> > +   || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> > +  

Re: [PATCH] target: [PR109657] (a ? -1 : 0) | b could be optimized better for aarch64

2023-05-02 Thread Richard Sandiford via Gcc-patches
Andrew Pinski via Gcc-patches  writes:
> There is no canonical form for this case defined. So the aarch64 backend needs
> a pattern to match both of these forms.
>
> The forms are:
> (set (reg/i:SI 0 x0)
> (if_then_else:SI (eq (reg:CC 66 cc)
> (const_int 0 [0]))
> (reg:SI 97)
> (const_int -1 [0x])))
> and
> (set (reg/i:SI 0 x0)
> (ior:SI (neg:SI (ne:SI (reg:CC 66 cc)
> (const_int 0 [0])))
> (reg:SI 102)))
>
> Currently the aarch64 backend matches the first form so this
> patch adds a insn_and_split to match the second form and
> convert it to the first form.
>
> OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions
>
>   PR target/109657
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.md (*cmov_insn_m1): New
>   insn_and_split pattern.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/csinv-2.c: New test.
> ---
>  gcc/config/aarch64/aarch64.md  | 20 +
>  gcc/testsuite/gcc.target/aarch64/csinv-2.c | 26 ++
>  2 files changed, 46 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/csinv-2.c
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index e1a2b265b20..57fe5601350 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -4194,6 +4194,26 @@ (define_insn "*cmovsi_insn_uxtw"
>[(set_attr "type" "csel, csel, csel, csel, csel, mov_imm, mov_imm")]
>  )
>  
> +;; There are two canonical forms for `cmp ? -1 : a`.
> +;; This is the second form and is here to help combine.
> +;; Support `-(cmp) | a` into `cmp ? -1 : a` to be canonical in the backend.
> +(define_insn_and_split "*cmov_insn_m1"
> +  [(set (match_operand:GPI 0 "register_operand" "=r")
> +(ior:GPI
> +  (neg:GPI
> +   (match_operator:GPI 1 "aarch64_comparison_operator"
> +[(match_operand 2 "cc_register" "") (const_int 0)]))
> +  (match_operand 3 "register_operand" "r")))]
> +  ""
> +  "#"
> +  "&& true"
> +  [(set (match_dup 0)
> + (if_then_else:GPI (match_dup 1)
> +  (const_int -1) (match_dup 3)))]

Sorry for the nit, but the formatting of the last two lines looks odd IMO.
How about:

(if_then_else:GPI (match_dup 1) (const_int -1) (match_dup 3))...

or:

(if_then_else:GPI (match_dup 1)
  (const_int -1)
  (match_dup 3))...

OK with that change, thanks.

Richard

> +  {}
> +  [(set_attr "type" "csel")]
> +)
> +
>  (define_insn "*cmovdi_insn_uxtw"
>[(set (match_operand:DI 0 "register_operand" "=r")
>   (if_then_else:DI
> diff --git a/gcc/testsuite/gcc.target/aarch64/csinv-2.c 
> b/gcc/testsuite/gcc.target/aarch64/csinv-2.c
> new file mode 100644
> index 000..89132acb713
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/csinv-2.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* PR target/109657: (a ? -1 : 0) | b could be better */
> +
> +/* Both functions should have the same assembly of:
> +   cmp w1, 0
> +   csinv   w0, w0, wzr, eq
> +
> +   We should not get:
> +   cmp w1, 0
> +   csetm   w1, ne
> +   orr w0, w1, w0
> + */
> +/* { dg-final { scan-assembler-times "csinv\tw\[0-9\]" 2 } } */
> +/* { dg-final { scan-assembler-not "csetm\tw\[0-9\]" } } */
> +unsigned b(unsigned a, unsigned b)
> +{
> +  if(b)
> +return -1;
> +  return a;
> +}
> +unsigned b1(unsigned a, unsigned b)
> +{
> +unsigned t = b ? -1 : 0;
> +return a | t;
> +}


Re: [aarch64] Code-gen for vector initialization involving constants

2023-05-02 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Tue, 2 May 2023 at 14:56, Richard Sandiford
>  wrote:
>> > [aarch64] Improve code-gen for vector initialization with single constant 
>> > element.
>> >
>> > gcc/ChangeLog:
>> >   * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak 
>> > condition
>> >   if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >   and if maxv == 1, use constant element for duplicating into register.
>> >
>> > gcc/testsuite/ChangeLog:
>> >   * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >
>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> > index 2b0de7ca038..f46750133a6 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >   and matches[X][1] with the count of duplicate elements (if X is the
>> >   earliest element which has duplicates).  */
>> >
>> > -  if (n_var == n_elts && n_elts <= 16)
>> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>> >  {
>> >int matches[16][2] = {0};
>> >for (int i = 0; i < n_elts; i++)
>> > @@ -7,6 +7,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >vector register.  For big-endian we want that position to hold
>> >the last element of VALS.  */
>> > maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> > +
>> > +   /* If we have a single constant element, use that for duplicating
>> > +  instead.  */
>> > +   if (n_var == n_elts - 1)
>> > + for (int i = 0; i < n_elts; i++)
>> > +   if (CONST_INT_P (XVECEXP (vals, 0, i))
>> > +   || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> > + {
>> > +   maxelement = i;
>> > +   break;
>> > + }
>> > +
>> > rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> > aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>>
>> We don't want to force the constant into a register though.
> OK right, sorry.
> With the attached patch, for the following test-case:
> int64x2_t f_s64(int64_t x)
> {
>   return (int64x2_t) { x, 1 };
> }
>
> it loads constant from memory (same code-gen as without patch).
> f_s64:
> adrpx1, .LC0
> ldr q0, [x1, #:lo12:.LC0]
> ins v0.d[0], x0
> ret
>
> Does the patch look OK ?
>
> Thanks,
> Prathamesh
> [...]
> [aarch64] Improve code-gen for vector initialization with single constant 
> element.
>
> gcc/ChangeLog:
>   * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
>   if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>   and if maxv == 1, use constant element for duplicating into register.
>
> gcc/testsuite/ChangeLog:
>   * gcc.target/aarch64/vec-init-single-const.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 2b0de7ca038..97309ddec4f 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>   and matches[X][1] with the count of duplicate elements (if X is the
>   earliest element which has duplicates).  */
>  
> -  if (n_var == n_elts && n_elts <= 16)
> +  if ((n_var >= n_elts - 1) && n_elts <= 16)

No need for the extra brackets.

>  {
>int matches[16][2] = {0};
>for (int i = 0; i < n_elts; i++)
> @@ -7,8 +7,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>vector register.  For big-endian we want that position to hold
>the last element of VALS.  */
> maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> -   rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> -   aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +
> +   /* If we have a single constant element, use that for duplicating
> +  instead.  */
> +   if (n_var == n_elts - 1)
> + for (int i = 0; i < n_elts; i++)
> +   if (CONST_INT_P (XVECEXP (vals, 0, i))
> +   || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> + {
> +   maxelement = i;
> +   break;
> + }
> +
> +   rtx maxval = XVECEXP (vals, 0, maxelement);
> +   if (!(CONST_INT_P (maxval) || CONST_DOUBLE_P (maxval)))
> + {
> +   rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +   aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> + }
> +   else
> + aarch64_emit_move (target, gen_vec_duplicate (mode, maxval));
>   }
>else
>   {

This seems a bit convoluted.  It might be easier to record whether
we see a CONST_INT_P or a CONST_DOUBLE_P during the previous loop,
and if so what the constant is.  Then handle that case first,
as a separate arm of the "if".

> 

Re: [PATCH v5 07/10] vect: Verify that GET_MODE_NUNITS is a multiple of 2.

2023-05-02 Thread Richard Sandiford via Gcc-patches
Michael Collison  writes:
> While working on autovectorizing for the RISCV port I encountered an issue
> where can_duplicate_and_interleave_p assumes that GET_MODE_NUNITS is a
> evenly divisible by two. The RISC-V target has vector modes (e.g. VNx1DImode),
> where GET_MODE_NUNITS is equal to one.
>
> Tested on RISCV and x86_64-linux-gnu. Okay?
>
> 2023-03-09  Michael Collison  
>
>   * tree-vect-slp.cc (can_duplicate_and_interleave_p):
>   Check that GET_MODE_NUNITS is a multiple of 2.

OK, thanks.  Doesn't need to wait for any other of the other patches
in the series.

Richard

> ---
>  gcc/tree-vect-slp.cc | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index d73deaecce0..a64fe454e19 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -423,10 +423,13 @@ can_duplicate_and_interleave_p (vec_info *vinfo, 
> unsigned int count,
>   (GET_MODE_BITSIZE (int_mode), 1);
> tree vector_type
>   = get_vectype_for_scalar_type (vinfo, int_type, count);
> +   poly_int64 half_nelts;
> if (vector_type
> && VECTOR_MODE_P (TYPE_MODE (vector_type))
> && known_eq (GET_MODE_SIZE (TYPE_MODE (vector_type)),
> -GET_MODE_SIZE (base_vector_mode)))
> +GET_MODE_SIZE (base_vector_mode))
> +   && multiple_p (GET_MODE_NUNITS (TYPE_MODE (vector_type)),
> +  2, _nelts))
>   {
> /* Try fusing consecutive sequences of COUNT / NVECTORS elements
>together into elements of type INT_TYPE and using the result
> @@ -434,7 +437,7 @@ can_duplicate_and_interleave_p (vec_info *vinfo, unsigned 
> int count,
> poly_uint64 nelts = GET_MODE_NUNITS (TYPE_MODE (vector_type));
> vec_perm_builder sel1 (nelts, 2, 3);
> vec_perm_builder sel2 (nelts, 2, 3);
> -   poly_int64 half_nelts = exact_div (nelts, 2);
> +
> for (unsigned int i = 0; i < 3; ++i)
>   {
> sel1.quick_push (i);


Re: [aarch64] Code-gen for vector initialization involving constants

2023-05-02 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Tue, 25 Apr 2023 at 16:29, Richard Sandiford
>  wrote:
>>
>> Prathamesh Kulkarni  writes:
>> > Hi Richard,
>> > While digging thru aarch64_expand_vector_init, I noticed it gives
>> > priority to loading a constant first:
>> >  /* Initialise a vector which is part-variable.  We want to first try
>> >  to build those lanes which are constant in the most efficient way we
>> >  can.  */
>> >
>> > which results in suboptimal code-gen for following case:
>> > int16x8_t f_s16(int16_t x)
>> > {
>> >   return (int16x8_t) { x, x, x, x, x, x, x, 1 };
>> > }
>> >
>> > code-gen trunk:
>> > f_s16:
>> > moviv0.8h, 0x1
>> > ins v0.h[0], w0
>> > ins v0.h[1], w0
>> > ins v0.h[2], w0
>> > ins v0.h[3], w0
>> > ins v0.h[4], w0
>> > ins v0.h[5], w0
>> > ins v0.h[6], w0
>> > ret
>> >
>> > The attached patch tweaks the following condition:
>> > if (n_var == n_elts && n_elts <= 16)
>> >   {
>> > ...
>> >   }
>> >
>> > to pass if maxv >= 80% of n_elts, with 80% being an
>> > arbitrary "high enough" threshold. The intent is to dup
>> > the most repeating variable if it it's repetition
>> > is "high enough" and insert constants which should be "better" than
>> > loading constant first and inserting variables like in the above case.
>>
>> I'm not too keen on the 80%.  Like you say, it seems a bit arbitrary.
>>
>> The case above can also be handled by relaxing n_var == n_elts to
>> n_var >= n_elts - 1, so that if there's just one constant element,
>> we look for duplicated variable elements.  If there are none
>> (maxv == 1), but there is a constant element, we can duplicate
>> the constant element into a register.
>>
>> The case when there's more than one constant element needs more thought
>> (and testcases :-)).  E.g. after a certain point, it would probably be
>> better to load the variable and constant parts separately and blend them
>> using TBL.  It also matters whether the constants are equal or not.
>>
>> There are also cases that could be handled using EXT.
>>
>> Plus, if we're inserting many variable elements that are already
>> in GPRs, we can probably do better by coalescing them into bigger
>> GPR values and inserting them as wider elements.
>>
>> Because of things like that, I think we should stick to the
>> single-constant case for now.
> Hi Richard,
> Thanks for the suggestions. The attached patch only handles the single
> constant case.
> Bootstrap+test in progress on aarch64-linux-gnu.
> Does it look OK ?
>
> Thanks,
> Prathamesh
>>
>> Thanks,
>> Richard
>
> [aarch64] Improve code-gen for vector initialization with single constant 
> element.
>
> gcc/ChangeLog:
>   * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
>   if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>   and if maxv == 1, use constant element for duplicating into register.
>
> gcc/testsuite/ChangeLog:
>   * gcc.target/aarch64/vec-init-single-const.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 2b0de7ca038..f46750133a6 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>   and matches[X][1] with the count of duplicate elements (if X is the
>   earliest element which has duplicates).  */
>  
> -  if (n_var == n_elts && n_elts <= 16)
> +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>  {
>int matches[16][2] = {0};
>for (int i = 0; i < n_elts; i++)
> @@ -7,6 +7,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>vector register.  For big-endian we want that position to hold
>the last element of VALS.  */
> maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> +
> +   /* If we have a single constant element, use that for duplicating
> +  instead.  */
> +   if (n_var == n_elts - 1)
> + for (int i = 0; i < n_elts; i++)
> +   if (CONST_INT_P (XVECEXP (vals, 0, i))
> +   || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> + {
> +   maxelement = i;
> +   break;
> + }
> +
> rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));

We don't want to force the constant into a register though.

>   }
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c 
> b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> new file mode 100644
> index 000..517f47b13ec
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> @@ -0,0 +1,66 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include 
> +
> +/*
> +** f_s8:
> +**   ...
> +**   dup 

Re: [PATCH] v2: Add targetm.libm_function_max_error

2023-04-28 Thread Richard Sandiford via Gcc-patches
Jakub Jelinek via Gcc-patches  writes:
> Hi!
>
> On Thu, Apr 27, 2023 at 10:34:59AM +, Richard Biener wrote:
>> OK. As said the patch itself looks good to me, let's go ahead.  We
>> have plenty of time to backtrack until GCC 14.
>
> Thanks.  Unfortunately when I started using it, I've discovered that the
> CASE_CFN_xxx_ALL macros don't include the CFN_xxx cases, just
> CFN_BUILT_IN_xxx* cases.
>
> So here is an updated version of the patch I'll bootstrap/regtest tonight
> which instead uses CASE_CFN_xxx: CASE_CFN_xxx_FN:

Shouldn't we change something in that case?  The point of these macros
is to wrap things up a single easy-to-use name, so something feels wrong
if we're having to use a repeated pattern like this.

Thanks,
Richard

> 2023-04-27  Jakub Jelinek  
>
>   * target.def (libm_function_max_error): New target hook.
>   * doc/tm.texi.in (TARGET_LIBM_FUNCTION_MAX_ERROR): Add.
>   * doc/tm.texi: Regenerated.
>   * targhooks.h (default_libm_function_max_error,
>   glibc_linux_libm_function_max_error): Declare.
>   * targhooks.cc: Include case-cfn-macros.h.
>   (default_libm_function_max_error,
>   glibc_linux_libm_function_max_error): New functions.
>   * config/linux.h (TARGET_LIBM_FUNCTION_MAX_ERROR): Redefine.
>   * config/linux-protos.h (linux_libm_function_max_error): Declare.
>   * config/linux.cc: Include target.h and targhooks.h.
>   (linux_libm_function_max_error): New function.
>   * config/arc/arc.cc: Include targhooks.h and case-cfn-macros.h.
>   (arc_libm_function_max_error): New function.
>   (TARGET_LIBM_FUNCTION_MAX_ERROR): Redefine.
>   * config/i386/i386.cc (ix86_libc_has_fast_function): Formatting fix.
>   (ix86_libm_function_max_error): New function.
>   (TARGET_LIBM_FUNCTION_MAX_ERROR): Redefine.
>   * config/rs6000/rs6000-protos.h
>   (rs6000_linux_libm_function_max_error): Declare.
>   * config/rs6000/rs6000-linux.cc: Include target.h, targhooks.h, tree.h
>   and case-cfn-macros.h.
>   (rs6000_linux_libm_function_max_error): New function.
>   * config/rs6000/linux.h (TARGET_LIBM_FUNCTION_MAX_ERROR): Redefine.
>   * config/rs6000/linux64.h (TARGET_LIBM_FUNCTION_MAX_ERROR): Redefine.
>   * config/or1k/or1k.cc: Include targhooks.h and case-cfn-macros.h.
>   (or1k_libm_function_max_error): New function.
>   (TARGET_LIBM_FUNCTION_MAX_ERROR): Redefine.
>
> --- gcc/target.def.jj 2023-04-27 10:17:32.598686398 +0200
> +++ gcc/target.def2023-04-27 10:26:58.361490211 +0200
> @@ -2670,6 +2670,23 @@ DEFHOOK
>   bool, (int fcode),
>   default_libc_has_fast_function)
>  
> +DEFHOOK
> +(libm_function_max_error,
> + "This hook determines expected maximum errors for math functions measured\n\
> +in ulps (units of the last place).  0 means 0.5ulps precision (correctly\n\
> +rounded).  ~0U means unknown errors.  The @code{combined_fn} @var{cfn}\n\
> +argument should identify just which math built-in function it is rather 
> than\n\
> +its variant, @var{mode} the variant in terms of floating-point machine 
> mode.\n\
> +The hook should also take into account @code{flag_rounding_math} whether 
> it\n\
> +is maximum error just in default rounding mode, or in all possible 
> rounding\n\
> +modes.  @var{boundary_p} is @code{true} for maximum errors on intrinsic 
> math\n\
> +boundaries of functions rather than errors inside of the usual result 
> ranges\n\
> +of the functions.  E.g.@ the sin/cos function finite result is in between\n\
> +-1.0 and 1.0 inclusive, with @var{boundary_p} true the function returns 
> how\n\
> +many ulps below or above those boundaries result could be.",
> + unsigned, (unsigned cfn, machine_mode mode, bool boundary_p),
> + default_libm_function_max_error)
> +
>  /* True if new jumps cannot be created, to replace existing ones or
> not, at the current point in the compilation.  */
>  DEFHOOK
> --- gcc/doc/tm.texi.in.jj 2023-04-27 10:17:32.596686427 +0200
> +++ gcc/doc/tm.texi.in2023-04-27 10:26:58.362490196 +0200
> @@ -4004,6 +4004,8 @@ macro, a reasonable default is used.
>  
>  @hook TARGET_LIBC_HAS_FAST_FUNCTION
>  
> +@hook TARGET_LIBM_FUNCTION_MAX_ERROR
> +
>  @defmac NEXT_OBJC_RUNTIME
>  Set this macro to 1 to use the "NeXT" Objective-C message sending conventions
>  by default.  This calling convention involves passing the object, the 
> selector
> --- gcc/doc/tm.texi.jj2023-04-27 10:17:32.593686470 +0200
> +++ gcc/doc/tm.texi   2023-04-27 10:26:58.364490167 +0200
> @@ -5760,6 +5760,21 @@ This hook determines whether a function
>  @code{(enum function_class)}@var{fcode} has a fast implementation.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} unsigned TARGET_LIBM_FUNCTION_MAX_ERROR (unsigned 
> @var{cfn}, machine_mode @var{mode}, bool @var{boundary_p})
> +This hook determines expected maximum errors for math functions measured
> +in ulps (units of the last place).  0 means 0.5ulps precision (correctly
> +rounded).  ~0U means 

Re: [PATCH] VECT: Add decrement IV iteration loop control by variable amount support

2023-04-26 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Hi, Richard.
> Would you mind take a look at the loop control part again:
>
> static gcond *
> vect_set_loop_condition_partial_vectors (class loop *loop,
> loop_vec_info loop_vinfo, tree niters,
> tree final_iv, bool niters_maybe_zero,
> gimple_stmt_iterator loop_cond_gsi)
> ...
> tree loop_len_x = NULL_TREE;
>   FOR_EACH_VEC_ELT (*controls, i, rgc)
> if (!rgc->controls.is_empty ())
>   {
> ...
>
> /* Set up all controls for this group.  */
> if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
>OPTIMIZE_FOR_SPEED))
>  test_ctrl
>= vect_set_loop_controls_by_select_vl (loop, loop_vinfo,
>   _seq, _seq,
>   rgc, niters, _len_x);
> ...
>   }
>
> static tree
> vect_set_loop_controls_by_select_vl (class loop *loop, loop_vec_info 
> loop_vinfo,
> gimple_seq *preheader_seq,
> gimple_seq *header_seq,
> rgroup_controls *rgc, tree niters, tree *x)
> {
>   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
>   tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
>   /* We are not allowing masked approach in SELECT_VL.  */
>   gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
>
>   tree ctrl_type = rgc->type;
>   unsigned int nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
>   poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type) * 
> rgc->factor;
>   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>
>   /* Calculate the maximum number of item values that the rgroup
>  handles in total, the number that it handles for each iteration
>  of the vector loop.  */
>   tree nitems_total = niters;
>   if (nitems_per_iter != 1)
> {
>   /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
> these multiplications don't overflow.  */
>   tree compare_factor = build_int_cst (compare_type, nitems_per_iter);
>   nitems_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>   nitems_total, compare_factor);
> }
>
>   /* Convert the comparison value to the IV type (either a no-op or
>  a promotion).  */
>   nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
>
>   /* Create an induction variable that counts the number of items
>  processed.  */
>   tree index_before_incr, index_after_incr;
>   gimple_stmt_iterator incr_gsi;
>   bool insert_after;
>   standard_iv_increment_position (loop, _gsi, _after);
>
>   /* Test the decremented IV, which will never underflow 0 since we have
>  IFN_SELECT_VL to gurantee that.  */
>   tree test_limit = nitems_total;
>
>   /* Provide a definition of each control in the group.  */
>   tree ctrl;
>   unsigned int i;
>   FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
> {
>   /* Previous controls will cover BIAS items.  This control covers the
> next batch.  */
>   poly_uint64 bias = nitems_per_ctrl * i;
>   tree bias_tree = build_int_cst (iv_type, bias);
>
>   /* Rather than have a new IV that starts at TEST_LIMIT and goes down to
> BIAS, prefer to use the same TEST_LIMIT - BIAS based IV for each
> control and adjust the bound down by BIAS.  */
>   tree this_test_limit = test_limit;
>   if (i != 0)
> {
>  this_test_limit = gimple_build (preheader_seq, MAX_EXPR, iv_type,
>  this_test_limit, bias_tree);
>  this_test_limit = gimple_build (preheader_seq, MINUS_EXPR, iv_type,
>  this_test_limit, bias_tree);
> }
>
>   /* Create decrement IV.  */
>   create_iv (this_test_limit, MINUS_EXPR, ctrl, NULL_TREE, loop, 
> _gsi,
> insert_after, _before_incr, _after_incr);
>
>   tree res_len;
>   if (rgc->controls.length () != 1)
> {
>  if (nitems_per_iter == 1)
>{
>  /* Generte length = (X - VF*I/N) capped to the range [0, VF/N]. */
>  /* step = VF * I / N.  */
>  tree step
> = build_int_cst (iv_type,
> exact_div (vf * i, rgc->controls.length ()));
>  /* Make sure (X - VF*I/N) never underflow zero.  */
>  tree max = gimple_build (header_seq, MAX_EXPR, iv_type, *x, step);
>  res_len
> = gimple_build (header_seq, MIN_EXPR, iv_type,
> index_before_incr,
> build_int_cst (iv_type, vf * nitems_per_iter));
>}
>  else
>{
>  /* For SLP, we can't allow non-VF number of elements to be
> processed in non-final iteration. We force the number of
> elements to be processed in each non-final iteration is VF
> elements. If we allow non-VF elements processing in non-final
> iteration will make SLP too complicated and produce inferior
> codegen.
>
>   For example:
>
>If non-final iteration process VF elements.
>
>  ...
>  .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0);
>  .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0);
>  ...
>
>If non-final iteration process non-VF elements.
>
>  ...
>  .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0);
>  if (_71 % 2 == 0)
>   .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0);
>  else
>   .LEN_STORE (vectp_f.8_56, 128B, _72, { 2, 1, 2, 1 }, 

Re: [PATCH] VECT: Add decrement IV iteration loop control by variable amount support

2023-04-26 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Thank you so much for pointing out this issue.
>
> After reading your comments carefully, I need to revise 
> "vect_set_loop_controls_by_while_len"  in  loop control like this:
>
> vect_set_loop_controls_by_while_len
> ... 
> tree X = NULL_TREE;
> FOR_EACH_VEC_ELT (rgc->controls, i, ctrl)
> ...
> if (i == 0) {
>   X = gimple_build (WHILE_LEN);
>   gimple_build_assign (ctrl, X);
> } else {
>   // (X - VF*I/N) capped to the range [0, VF/N]
>   tree t = gimple_build (MINUS, X, build_int_cst (VF*I/N));
>   gimple_build_assign (ctrl, t);
> }
> }
> 
>
> Am I understand your idea correctly ?

I think it's more that rgc->controls.length () == 1 is a special case,
rather than i == 0 being a special case.

That is, rgc->controls.length () == 1 can use a single WHILE_LEN to
calculate the number of scalars that will be processed by the current
loop iteration.  Let's call it X.  Then all rgroups with
rgc->controls.length () > 1 will be based on X rather than using
WHILE_LEN.  (And they would do that even for the first control in the
group, i.e. for i == 0.)

I'm not saying it has to be this way.  It might be that a different
arrangement is better for the later RVV processing.  But there needs
to be something in the gimple-level description, and something in
the optab documentation, that guarantees that whatever code we
generate for these cases works correctly.

BTW, very minor thing (I should have raised it earlier), but maybe
something like SELECT_VL would be a better name than WHILE_LEN?
WHILE_ULT means "while (IV) is unsigned less than" and so describes
an operation in terms of its arguments.  But I think WHILE_LEN is
more describing an operation based on its use case.

Thanks,
Richard


>
> So the example you shows in ARM SVE gimple IR, is like this:
>
> _3 =   [(long int *)_2];
>   vect__4.6_15 = .MASK_LOAD (_3, 64B, loop_mask_21); (INT64)
>   _5 =   [(long int *)_2 + POLY_INT_CST [16B, 
> 16B]];
>   vect__4.7_8 = .MASK_LOAD (_5, 64B, loop_mask_20);(INT64)
>   _7 =   [(long int *)_2 + POLY_INT_CST [32B, 
> 32B]];
>   vect__4.8_28 = .MASK_LOAD (_7, 64B, loop_mask_19);(INT64)
>   _24 =   [(long int *)_2 + POLY_INT_CST [48B, 
> 48B]];
>   vect__4.9_30 = .MASK_LOAD (_24, 64B, loop_mask_16); (INT64)
> vect__7.11_31 = VEC_PACK_TRUNC_EXPR ;
>   vect__7.11_32 = VEC_PACK_TRUNC_EXPR ;
>   vect__7.10_33 = VEC_PACK_TRUNC_EXPR ;
> ...
> .MASK_STORE (_13, 16B, loop_mask_36, vect__7.10_33); (INT16)
>
> If it is changed into WHILE_LEN style,  it should be:
>   
>X = WHILE_LEN;
> _3 =   [(long int *)_2];
>   vect__4.6_15 = .LEN_LOAD (_3, 64B, X - VF*1/N); (INT64)
>   _5 =   [(long int *)_2 + (X - VF*1/N)*8 ];
>   vect__4.7_8 = .LEN_LOAD (_5, 64B, X - VF*2/N);(INT64)
>   _7 =   [(long int *)_2 + (X - VF*2/N)*8];
>   vect__4.8_28 = .LEN_LOAD (_7, 64B, X - VF*3/N);(INT64)
>   _24 =   [(long int *)_2 + (X - VF*3/N)*8];
>   vect__4.9_30 = .LEN_LOAD (_24, 64B, X - VF*4/N); (INT64)
> vect__7.11_31 = VEC_PACK_TRUNC_EXPR ;
>   vect__7.11_32 = VEC_PACK_TRUNC_EXPR ;
>   vect__7.10_33 = VEC_PACK_TRUNC_EXPR ;
> ...
> .LEN_STORE (_13, 16B, X, vect__7.10_33); (INT16)
>
> Is this correct ? 
>
> Thanks.
>
>
> juzhe.zh...@rivai.ai
>  
> From: Richard Sandiford
> Date: 2023-04-26 16:06
> To: juzhe.zhong\@rivai.ai
> CC: gcc-patches; rguenther
> Subject: Re: [PATCH] VECT: Add decrement IV iteration loop control by 
> variable amount support
> "juzhe.zh...@rivai.ai"  writes:
>> Thanks Richard so much.
>>
 I don't think that's guaranteed by the proposed definition of WHILE_LEN.
 The first int64_t WHILE_LEN could come up short, and return something
 less than VF/2.
>>
>> I am so sorry that the comments of vect_set_loop_controls_by_while_len
>> is totally misleading and incorrect and I have sent V3 patch to fix that.
>> Actually, I don't use WHILE_LEN in multi-rgroups situation, instead, I use 
>> MIN_EXPR
>> to force VF elements for each non-final iteration to make sure result is 
>> correct.
>>
>> Yes, I agree with you that WHILE_LEN will produce issues for SLP situation 
>> that
>> having multi-rgroups since WHILE_LEN definition is allow target produces 
>> non-VF
>> outcome in non-final iteration. 
>  
> Yeah, I'd read that you weren't using WHILE_LEN for SLP.  I was talking
> specifically about non-SLP though (nitems_per_iter == 1).  Consider:
>  
> void f(short *x, long *y) {
>   for (int i = 0; i < 100; ++i)
> x[i] = y[i];
> }
>  
> compiled at -O3 -fno-vect-cost-model for SVE:
>  
> whilelo p4.d, wzr, w6
> whilelo p3.d, wzr, w5
> whilelo p2.h, wzr, w3
> whilelo p1.d, wzr, w3
> whilelo p0.d, wzr, w4
> .L2:
> ld1dz2.d, p0/z, [x1, #1, mul vl]
> ld1dz0.d, p1/z, [x1]
> ld1dz1.d, p3/z, [x1, #2, mul vl]
> uzp1z0.s, z0.s, z2.s
> ld1dz2.d, p4/z, [x1, #3, mul vl]
> uzp1z1.s, z1.s, z2.s
> uzp1z0.h, z0.h, z1.h
> st1hz0.h, p2, [x0, x2, lsl 1]
> add x2, x2, 

Re: [PATCH] VECT: Add decrement IV iteration loop control by variable amount support

2023-04-26 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Thanks Richard so much.
>
>>> I don't think that's guaranteed by the proposed definition of WHILE_LEN.
>>> The first int64_t WHILE_LEN could come up short, and return something
>>> less than VF/2.
>
> I am so sorry that the comments of vect_set_loop_controls_by_while_len
> is totally misleading and incorrect and I have sent V3 patch to fix that.
> Actually, I don't use WHILE_LEN in multi-rgroups situation, instead, I use 
> MIN_EXPR
> to force VF elements for each non-final iteration to make sure result is 
> correct.
>
> Yes, I agree with you that WHILE_LEN will produce issues for SLP situation 
> that
> having multi-rgroups since WHILE_LEN definition is allow target produces 
> non-VF
> outcome in non-final iteration. 

Yeah, I'd read that you weren't using WHILE_LEN for SLP.  I was talking
specifically about non-SLP though (nitems_per_iter == 1).  Consider:

void f(short *x, long *y) {
  for (int i = 0; i < 100; ++i)
x[i] = y[i];
}

compiled at -O3 -fno-vect-cost-model for SVE:

whilelo p4.d, wzr, w6
whilelo p3.d, wzr, w5
whilelo p2.h, wzr, w3
whilelo p1.d, wzr, w3
whilelo p0.d, wzr, w4
.L2:
ld1dz2.d, p0/z, [x1, #1, mul vl]
ld1dz0.d, p1/z, [x1]
ld1dz1.d, p3/z, [x1, #2, mul vl]
uzp1z0.s, z0.s, z2.s
ld1dz2.d, p4/z, [x1, #3, mul vl]
uzp1z1.s, z1.s, z2.s
uzp1z0.h, z0.h, z1.h
st1hz0.h, p2, [x0, x2, lsl 1]
add x2, x2, x8
whilelo p2.h, w2, w3
whilelo p4.d, w2, w6
whilelo p3.d, w2, w5
whilelo p0.d, w2, w4
add x1, x1, x7
whilelo p1.d, w2, w3
b.any   .L2

This is a non-SLP loop.  We have two rgroups: a single-mask/control
rgroup for the short vector, and a 4-mask/control rgroup for the long
vector.  And the loop converts the Nth long scalar (selected from 4
concatenated vectors) to the Nth short scalar (in a single vector).

It's therefore important that the 4-mask/control rgroup and the
single-mask/control rgroup treat the same lanes/scalar iterations
as active and the same lanes/scalar iterations as inactive.

But if I read the code correctly, the patch would generate 5 WHILE_LENs
for this case, since nitems_per_iter==1 for all 5 controls.  And I don't
think the documentation of WHILE_LEN guarantees that that will work
correctly, given that WHILE_LEN isn't a simple MIN operation.

It might be that it works correctly on RVV, given the later
backend-specific processing.  But I'm looking at this as a purely
gimple thing.  If something guarantees that the above works then
I think the WHILE_LEN documentation needs to be updated.

>From the current documentation of WHILE_LEN, I think the safe
approach would be to use WHILE_LEN for a single-control rgroup
and then "expand" that to larger control rgroups using arithmetic.
Specifically, if the length computed by the single-control rgroup
is X, then control I in an N-control rgroup would need to be:

   (X - VF*I/N) capped to the range [0, VF/N]

SVE does something similar for:

void f(short *x, int *y) {
  for (int i = 0; i < 100; ++i)
x[i] = y[i];
}

Here we use a single WHILELO and then unpack it, rather than use
three independent WHILELOs:

whilelo p0.h, wzr, w3
.L2:
punpklo p2.h, p0.b
punpkhi p1.h, p0.b
ld1wz0.s, p2/z, [x1, x2, lsl 2]
ld1wz1.s, p1/z, [x5, x2, lsl 2]
uzp1z0.h, z0.h, z1.h
st1hz0.h, p0, [x0, x2, lsl 1]
add x2, x2, x4
whilelo p0.h, w2, w3
b.any   .L2

This is dreadful code (hence the -fno-vect-cost-model).  And I'm sure
it's equally bad for RVV.  But the point is that we can generate it,
and in more exotic cases it might even be worthwhile.

Thanks,
Richard


Re: [PATCH] VECT: Add decrement IV iteration loop control by variable amount support

2023-04-26 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, 25 Apr 2023, Richard Sandiford wrote:
>> juzhe.zh...@rivai.ai writes:
>> > diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc
>> > index a52277abdbf..54845a62298 100644
>> > --- a/gcc/tree-ssa-loop-manip.cc
>> > +++ b/gcc/tree-ssa-loop-manip.cc
>> > @@ -59,14 +59,14 @@ static bitmap_obstack loop_renamer_obstack;
>> >  void
>> >  create_iv (tree base, tree step, tree var, class loop *loop,
>> >   gimple_stmt_iterator *incr_pos, bool after,
>> > - tree *var_before, tree *var_after)
>> > + tree *var_before, tree *var_after, enum tree_code code)
>> 
>> The comment needs to be updated to describe the new interface.
>> 
>> This is personal preference, but: I think the interface would be
>> clearer if the code argument came between the base and step,
>> so that the order matches a SCEV.  The code could no longer be
>> a default argument, and so all callers would need to be updated,
>> but IMO that's OK.  Not sure what others think though.
>
> Just a quick comment - I think decrementing IVs are already supported,
> you just have to make 'step' negative (or large positive).  If you
> really think using MINUS_EXPR is better or even required for
> VLA step please add an assert that 'code' is either PLUS_EXPR or
> MINUS_EXPR.
>
> Note that for INTEGER_CST step we rewrite x - CST to x + -CST
> during folding.

Yeah.  I think the problem in this case is that the step is variable.
So if we only supported PLUS_EXPRs, we'd need a separate NEGATE_EXPR
stmt (which presumably would be folded in later).

Thanks,
Richard


Re: [PATCH] VECT: Add decrement IV iteration loop control by variable amount support

2023-04-25 Thread Richard Sandiford via Gcc-patches
juzhe.zh...@rivai.ai writes:
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 6e81dc05e0e..5f44def90d3 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -127,6 +127,7 @@ init_internal_fns ()
>  #define cond_binary_direct { 1, 1, true }
>  #define cond_ternary_direct { 1, 1, true }
>  #define while_direct { 0, 2, false }
> +#define while_len_direct { 0, 0, false }
>  #define fold_extract_direct { 2, 2, false }
>  #define fold_left_direct { 1, 1, false }
>  #define mask_fold_left_direct { 1, 1, false }
> @@ -3702,6 +3703,33 @@ expand_while_optab_fn (internal_fn, gcall *stmt, 
> convert_optab optab)
>  emit_move_insn (lhs_rtx, ops[0].value);
>  }
>  
> +/* Expand WHILE_LEN call STMT using optab OPTAB.  */
> +static void
> +expand_while_len_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
> +{
> +  expand_operand ops[3];
> +  tree rhs_type[2];
> +
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree lhs_type = TREE_TYPE (lhs);
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  create_output_operand ([0], lhs_rtx, TYPE_MODE (lhs_type));
> +
> +  for (unsigned int i = 0; i < gimple_call_num_args (stmt); ++i)
> +{
> +  tree rhs = gimple_call_arg (stmt, i);
> +  rhs_type[i] = TREE_TYPE (rhs);
> +  rtx rhs_rtx = expand_normal (rhs);
> +  create_input_operand ([i + 1], rhs_rtx, TYPE_MODE (rhs_type[i]));
> +}
> +
> +  insn_code icode = direct_optab_handler (optab, TYPE_MODE (rhs_type[0]));
> +
> +  expand_insn (icode, 3, ops);
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +emit_move_insn (lhs_rtx, ops[0].value);
> +}

Is this ifn-specific handling needed?  From the description, it seems
like WHILE_LEN could be a normal binary internal function.

> +
>  /* Expand a call to a convert-like optab using the operands in STMT.
> FN has a single output operand and NARGS input operands.  */
>  
> @@ -3843,6 +3871,7 @@ multi_vector_optab_supported_p (convert_optab optab, 
> tree_pair types,
>  #define direct_scatter_store_optab_supported_p convert_optab_supported_p
>  #define direct_len_store_optab_supported_p direct_optab_supported_p
>  #define direct_while_optab_supported_p convert_optab_supported_p
> +#define direct_while_len_optab_supported_p direct_optab_supported_p
>  #define direct_fold_extract_optab_supported_p direct_optab_supported_p
>  #define direct_fold_left_optab_supported_p direct_optab_supported_p
>  #define direct_mask_fold_left_optab_supported_p direct_optab_supported_p
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 7fe742c2ae7..3a933abff5d 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -153,6 +153,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
>  DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
>  
>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
> +DEF_INTERNAL_OPTAB_FN (WHILE_LEN, ECF_CONST | ECF_NOTHROW, while_len, 
> while_len)
>  DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
>  check_raw_ptrs, check_ptrs)
>  DEF_INTERNAL_OPTAB_FN (CHECK_WAR_PTRS, ECF_CONST | ECF_NOTHROW,
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 695f5911b30..f5938bd2c24 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -476,3 +476,4 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
>  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
>  OPTAB_D (len_load_optab, "len_load_$a")
>  OPTAB_D (len_store_optab, "len_store_$a")
> +OPTAB_D (while_len_optab, "while_len$a")
> diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc
> index a52277abdbf..54845a62298 100644
> --- a/gcc/tree-ssa-loop-manip.cc
> +++ b/gcc/tree-ssa-loop-manip.cc
> @@ -59,14 +59,14 @@ static bitmap_obstack loop_renamer_obstack;
>  void
>  create_iv (tree base, tree step, tree var, class loop *loop,
>  gimple_stmt_iterator *incr_pos, bool after,
> -tree *var_before, tree *var_after)
> +tree *var_before, tree *var_after, enum tree_code code)

The comment needs to be updated to describe the new interface.

This is personal preference, but: I think the interface would be
clearer if the code argument came between the base and step,
so that the order matches a SCEV.  The code could no longer be
a default argument, and so all callers would need to be updated,
but IMO that's OK.  Not sure what others think though.

>  {
>gassign *stmt;
>gphi *phi;
>tree initial, step1;
>gimple_seq stmts;
>tree vb, va;
> -  enum tree_code incr_op = PLUS_EXPR;
> +  enum tree_code incr_op = code;
>edge pe = loop_preheader_edge (loop);
>  
>if (var != NULL_TREE)
> diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
> index d49273a3987..da755320a3a 100644
> --- a/gcc/tree-ssa-loop-manip.h
> +++ b/gcc/tree-ssa-loop-manip.h
> @@ -23,7 +23,7 @@ along with GCC; see the file COPYING3.  If not see
>  typedef void (*transform_callback)(class loop *, void *);
>  
>  extern 

Re: [aarch64] Code-gen for vector initialization involving constants

2023-04-25 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> Hi Richard,
> While digging thru aarch64_expand_vector_init, I noticed it gives
> priority to loading a constant first:
>  /* Initialise a vector which is part-variable.  We want to first try
>  to build those lanes which are constant in the most efficient way we
>  can.  */
>
> which results in suboptimal code-gen for following case:
> int16x8_t f_s16(int16_t x)
> {
>   return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> }
>
> code-gen trunk:
> f_s16:
> moviv0.8h, 0x1
> ins v0.h[0], w0
> ins v0.h[1], w0
> ins v0.h[2], w0
> ins v0.h[3], w0
> ins v0.h[4], w0
> ins v0.h[5], w0
> ins v0.h[6], w0
> ret
>
> The attached patch tweaks the following condition:
> if (n_var == n_elts && n_elts <= 16)
>   {
> ...
>   }
>
> to pass if maxv >= 80% of n_elts, with 80% being an
> arbitrary "high enough" threshold. The intent is to dup
> the most repeating variable if it it's repetition
> is "high enough" and insert constants which should be "better" than
> loading constant first and inserting variables like in the above case.

I'm not too keen on the 80%.  Like you say, it seems a bit arbitrary.

The case above can also be handled by relaxing n_var == n_elts to
n_var >= n_elts - 1, so that if there's just one constant element,
we look for duplicated variable elements.  If there are none
(maxv == 1), but there is a constant element, we can duplicate
the constant element into a register.

The case when there's more than one constant element needs more thought
(and testcases :-)).  E.g. after a certain point, it would probably be
better to load the variable and constant parts separately and blend them
using TBL.  It also matters whether the constants are equal or not.

There are also cases that could be handled using EXT.

Plus, if we're inserting many variable elements that are already
in GPRs, we can probably do better by coalescing them into bigger
GPR values and inserting them as wider elements.

Because of things like that, I think we should stick to the
single-constant case for now.

Thanks,
Richard


Re: [ping][vect-patterns] Refactor widen_plus/widen_minus as internal_fns

2023-04-24 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Thu, Apr 20, 2023 at 3:24 PM Andre Vieira (lists) via Gcc-patches
>  wrote:
>>
>> Rebased all three patches and made some small changes to the second one:
>> - removed sub and abd optabs from commutative_optab_p, I suspect this
>> was a copy paste mistake,
>> - removed what I believe to be a superfluous switch case in vectorizable
>> conversion, the one that was here:
>> +  if (code.is_fn_code ())
>> + {
>> +  internal_fn ifn = as_internal_fn (code.as_fn_code ());
>> +  int ecf_flags = internal_fn_flags (ifn);
>> +  gcc_assert (ecf_flags & ECF_MULTI);
>> +
>> +  switch (code.as_fn_code ())
>> +   {
>> +   case CFN_VEC_WIDEN_PLUS:
>> + break;
>> +   case CFN_VEC_WIDEN_MINUS:
>> + break;
>> +   case CFN_LAST:
>> +   default:
>> + return false;
>> +   }
>> +
>> +  internal_fn lo, hi;
>> +  lookup_multi_internal_fn (ifn, , );
>> +  *code1 = as_combined_fn (lo);
>> +  *code2 = as_combined_fn (hi);
>> +  optab1 = lookup_multi_ifn_optab (lo, !TYPE_UNSIGNED (vectype));
>> +  optab2 = lookup_multi_ifn_optab (hi, !TYPE_UNSIGNED (vectype));
>>   }
>>
>> I don't think we need to check they are a specfic fn code, as we look-up
>> optabs and if they succeed then surely we can vectorize?
>>
>> OK for trunk?
>
> In the first patch I see some uses of safe_as_tree_code like
>
> +  if (ch.is_tree_code ())
> +return op1 == NULL_TREE ? gimple_build_assign (lhs,
> ch.safe_as_tree_code (),
> +  op0) :
> + gimple_build_assign (lhs, ch.safe_as_tree_code 
> (),
> +  op0, op1);
> +  else
> +  {
> +internal_fn fn = as_internal_fn (ch.safe_as_fn_code ());
> +gimple* stmt;
>
> where the context actually requires a valid tree code.  Please change those
> to force to tree code / ifn code.  Just use explicit casts here and the other
> places that are similar.  Before the as_internal_fn just put a
> gcc_assert (ch.is_internal_fn ()).

Also, doesn't the above ?: simplify to the "else" arm?  Null trailing
arguments would be ignored for unary operators.

I wasn't sure what to make of the op0 handling:

> +/* Build a GIMPLE_ASSIGN or GIMPLE_CALL with the tree_code,
> +   or internal_fn contained in ch, respectively.  */
> +gimple *
> +vect_gimple_build (tree lhs, code_helper ch, tree op0, tree op1)
> +{
> +  if (op0 == NULL_TREE)
> +return NULL;

Can that happen, and if so, does returning null make sense?
Maybe an assert would be safer.

Thanks,
Richard


Re: [PATCH] RFC: New compact syntax for insn and insn_split in Machine Descriptions

2023-04-24 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Friday, April 21, 2023 6:19 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> 
>> Subject: Re: [PATCH] RFC: New compact syntax for insn and insn_split in
>> Machine Descriptions
>> 
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > This patch adds support for a compact syntax for specifying
>> > constraints in instruction patterns. Credit for the idea goes to Richard
>> Earnshaw.
>> >
>> > I am sending up this RFC to get feedback for it's inclusion in GCC 14.
>> > With this new syntax we want a clean break from the current
>> > limitations to make something that is hopefully easier to use and maintain.
>> >
>> > The idea behind this compact syntax is that often times it's quite
>> > hard to correlate the entries in the constrains list, attributes and 
>> > instruction
>> lists.
>> >
>> > One has to count and this often is tedious.  Additionally when
>> > changing a single line in the insn multiple lines in a diff change,
>> > making it harder to see what's going on.
>> >
>> > This new syntax takes into account many of the common things that are
>> done in MD
>> > files.   It's also worth saying that this version is intended to deal with 
>> > the
>> > common case of a string based alternatives.   For C chunks we have some
>> ideas
>> > but those are not intended to be addressed here.
>> >
>> > It's easiest to explain with an example:
>> >
>> > normal syntax:
>> >
>> > (define_insn_and_split "*movsi_aarch64"
>> >   [(set (match_operand:SI 0 "nonimmediate_operand" "=r,k,r,r,r,r, r,w, m, 
>> > m,
>> r,  r,  r, w,r,w, w")
>> >(match_operand:SI 1 "aarch64_mov_operand"  "
>> r,r,k,M,n,Usv,m,m,rZ,w,Usw,Usa,Ush,rZ,w,w,Ds"))]
>> >   "(register_operand (operands[0], SImode)
>> > || aarch64_reg_or_zero (operands[1], SImode))"
>> >   "@
>> >mov\\t%w0, %w1
>> >mov\\t%w0, %w1
>> >mov\\t%w0, %w1
>> >mov\\t%w0, %1
>> >#
>> >* return aarch64_output_sve_cnt_immediate (\"cnt\", \"%x0\",
>> operands[1]);
>> >ldr\\t%w0, %1
>> >ldr\\t%s0, %1
>> >str\\t%w1, %0
>> >str\\t%s1, %0
>> >adrp\\t%x0, %A1\;ldr\\t%w0, [%x0, %L1]
>> >adr\\t%x0, %c1
>> >adrp\\t%x0, %A1
>> >fmov\\t%s0, %w1
>> >fmov\\t%w0, %s1
>> >fmov\\t%s0, %s1
>> >* return aarch64_output_scalar_simd_mov_immediate (operands[1],
>> SImode);"
>> >   "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL
>> (operands[1]), SImode)
>> > && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
>> >[(const_int 0)]
>> >"{
>> >aarch64_expand_mov_immediate (operands[0], operands[1]);
>> >DONE;
>> > }"
>> >   ;; The "mov_imm" type for CNT is just a placeholder.
>> >   [(set_attr "type"
>> "mov_reg,mov_reg,mov_reg,mov_imm,mov_imm,mov_imm,load_4,
>> >
>> load_4,store_4,store_4,load_4,adr,adr,f_mcr,f_mrc,fmov,neon_move")
>> >(set_attr "arch"   "*,*,*,*,*,sve,*,fp,*,fp,*,*,*,fp,fp,fp,simd")
>> >(set_attr "length" "4,4,4,4,*,  4,4, 4,4, 4,8,4,4, 4, 4, 4,   4")
>> > ]
>> > )
>> >
>> > New syntax:
>> >
>> > (define_insn_and_split "*movsi_aarch64"
>> >   [(set (match_operand:SI 0 "nonimmediate_operand")
>> >(match_operand:SI 1 "aarch64_mov_operand"))]
>> >   "(register_operand (operands[0], SImode)
>> > || aarch64_reg_or_zero (operands[1], SImode))"
>> >   "@@ (cons: 0 1; attrs: type arch length)
>> >[=r, r  ; mov_reg  , *   , 4] mov\t%w0, %w1
>> >[k , r  ; mov_reg  , *   , 4] ^
>> >[r , k  ; mov_reg  , *   , 4] ^
>> >[r , M  ; mov_imm  , *   , 4] mov\t%w0, %1
>> >[r , n  ; mov_imm  , *   , *] #
>> >[r , Usv; mov_imm  , sve , 4] << aarch64_output_sve_cnt_immediate 
>> > ('cnt',
>> '%x0', operands[1]);
>> >[r , m  ; load_4   , *   , 4] ldr\t%w0, %1
>> >[w , m  ; load_4   , fp  , 4] ldr\t%s0, %1
>> >[m , rZ ; store_4  , *   , 4] str\t%w1, %0
>> >[m , w  ; store_4  , fp  , 4] str\t%s1, %0
>> >[r , Usw; load_4   , *   , 8] adrp\t%x0, %A1;ldr\t%w0, [%x0, %L1]
>> >[r , Usa; adr  , *   , 4] adr\t%x0, %c1
>> >[r , Ush; adr  , *   , 4] adrp\t%x0, %A1
>> >[w , rZ ; f_mcr, fp  , 4] fmov\t%s0, %w1
>> >[r , w  ; f_mrc, fp  , 4] fmov\t%w0, %s1
>> >[w , w  ; fmov , fp  , 4] fmov\t%s0, %s1
>> >[w , Ds ; neon_move, simd, 4] <<
>> aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);"
>> >   "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL
>> (operands[1]), SImode)
>> > && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
>> >   [(const_int 0)]
>> >   {
>> > aarch64_expand_mov_immediate (operands[0], operands[1]);
>> > DONE;
>> >   }
>> >   ;; The "mov_imm" type for CNT is just a placeholder.
>> > )
>> >
>> > The patch contains some more rewritten examples for both Arm and
>> > AArch64.  I have included them for examples in this RFC but the final
>> > version posted in GCC 14 will have these split out.
>> >
>> > The main syntax rules 

Re: [match.pd] [SVE] Add pattern to transform svrev(svrev(v)) --> v

2023-04-24 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> gcc/ChangeLog:
>   * tree-ssa-forwprop.cc (is_combined_permutation_identity): Try to
>   simplify two successive VEC_PERM_EXPRs with single operand and same
>   mask, where mask chooses elements in reverse order.
>
> gcc/testesuite/ChangeLog:
>   * gcc.target/aarch64/sve/acle/general/rev-1.c: New test.
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/rev-1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/rev-1.c
> new file mode 100644
> index 000..e57ee67d716
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/rev-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-optimized" } */
> +
> +#include 
> +
> +svint32_t f(svint32_t v)
> +{
> +  return svrev_s32 (svrev_s32 (v));
> +}
> +
> +/* { dg-final { scan-tree-dump "return v_1\\(D\\)" "optimized" } } */
> +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */
> diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc
> index 9b567440ba4..61df7efe82c 100644
> --- a/gcc/tree-ssa-forwprop.cc
> +++ b/gcc/tree-ssa-forwprop.cc
> @@ -2541,6 +2541,27 @@ is_combined_permutation_identity (tree mask1, tree 
> mask2)
>  
>gcc_checking_assert (TREE_CODE (mask1) == VECTOR_CST
>  && TREE_CODE (mask2) == VECTOR_CST);
> +
> +  /* For VLA masks, check for the following pattern:
> + v1 = VEC_PERM_EXPR (v0, v0, mask1)
> + v2 = VEC_PERM_EXPR (v1, v1, mask2)

Maybe blank out the second operands using "...":

 v1 = VEC_PERM_EXPR (v0, ..., mask1)
 v2 = VEC_PERM_EXPR (v1, ..., mask2)

to make it clear that they don't matter.

OK with that change, thanks.

Richard

> + -->
> + v2 = v0
> + if mask1 == mask2 == {nelts - 1, nelts - 2, ...}.  */
> +
> +  if (operand_equal_p (mask1, mask2, 0)
> +  && !VECTOR_CST_NELTS (mask1).is_constant ())
> +{
> +  vec_perm_builder builder;
> +  if (tree_to_vec_perm_builder (, mask1))
> + {
> +   poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask1));
> +   vec_perm_indices sel (builder, 1, nelts);
> +   if (sel.series_p (0, 1, nelts - 1, -1))
> + return 1;
> + }
> +}
> +
>mask = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (mask1), mask1, mask1, 
> mask2);
>if (mask == NULL_TREE || TREE_CODE (mask) != VECTOR_CST)
>  return 0;


Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

2023-04-24 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> [aarch64] Recursively intialize even and odd sub-parts and merge with zip1.
>
> gcc/ChangeLog:
>   * config/aarch64/aarch64.cc (aarch64_expand_vector_init_fallback): 
> Rename
>   aarch64_expand_vector_init to this, and remove  interleaving case.
>   Recursively call aarch64_expand_vector_init_fallback, instead of
>   aarch64_expand_vector_init.
>   (aarch64_unzip_vector_init): New function.
>   (aarch64_expand_vector_init): Likewise.
>
> gcc/testsuite/ChangeLog:
>   * gcc.target/aarch64/ldp_stp_16.c (cons2_8_float): Adjust for new
>   code-gen.
>   * gcc.target/aarch64/sve/acle/general/dupq_5.c: Likewise.
>   * gcc.target/aarch64/sve/acle/general/dupq_6.c: Likewise.
>   * gcc.target/aarch64/vec-init-18.c: Rename interleave-init-1.c to
>   this.
>   * gcc.target/aarch64/vec-init-19.c: New test.
>   * gcc.target/aarch64/vec-init-20.c: Likewise.
>   * gcc.target/aarch64/vec-init-21.c: Likewise.
>   * gcc.target/aarch64/vec-init-22-size.c: Likewise.
>   * gcc.target/aarch64/vec-init-22-speed.c: Likewise.
>   * gcc.target/aarch64/vec-init-22.h: New header.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index d7e895f8d34..416e062829c 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22026,11 +22026,12 @@ aarch64_simd_make_constant (rtx vals)
>  return NULL_RTX;
>  }
>  
> -/* Expand a vector initialisation sequence, such that TARGET is
> -   initialised to contain VALS.  */
> +/* A subroutine of aarch64_expand_vector_init, with the same interface.
> +   The caller has already tried a divide-and-conquer approach, so do
> +   not consider that case here.  */
>  
>  void
> -aarch64_expand_vector_init (rtx target, rtx vals)
> +aarch64_expand_vector_init_fallback (rtx target, rtx vals)
>  {
>machine_mode mode = GET_MODE (target);
>scalar_mode inner_mode = GET_MODE_INNER (mode);
> @@ -22090,38 +22091,6 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>return;
>  }
>  
> -  /* Check for interleaving case.
> - For eg if initializer is (int16x8_t) {x, y, x, y, x, y, x, y}.
> - Generate following code:
> - dup v0.h, x
> - dup v1.h, y
> - zip1 v0.h, v0.h, v1.h
> - for "large enough" initializer.  */
> -
> -  if (n_elts >= 8)
> -{
> -  int i;
> -  for (i = 2; i < n_elts; i++)
> - if (!rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, i % 2)))
> -   break;
> -
> -  if (i == n_elts)
> - {
> -   machine_mode mode = GET_MODE (target);
> -   rtx dest[2];
> -
> -   for (int i = 0; i < 2; i++)
> - {
> -   rtx x = expand_vector_broadcast (mode, XVECEXP (vals, 0, i));
> -   dest[i] = force_reg (mode, x);
> - }
> -
> -   rtvec v = gen_rtvec (2, dest[0], dest[1]);
> -   emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1));
> -   return;
> - }
> -}
> -
>enum insn_code icode = optab_handler (vec_set_optab, mode);
>gcc_assert (icode != CODE_FOR_nothing);
>  
> @@ -22243,7 +22212,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>   }
> XVECEXP (copy, 0, i) = subst;
>   }
> -  aarch64_expand_vector_init (target, copy);
> +  aarch64_expand_vector_init_fallback (target, copy);
>  }
>  
>/* Insert the variable lanes directly.  */
> @@ -22257,6 +6,81 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>  }
>  }
>  
> +/* Return even or odd half of VALS depending on EVEN_P.  */
> +
> +static rtx
> +aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
> +{
> +  int n = XVECLEN (vals, 0);
> +  machine_mode new_mode
> += aarch64_simd_container_mode (GET_MODE_INNER (mode),
> +GET_MODE_BITSIZE (mode).to_constant () / 2);
> +  rtvec vec = rtvec_alloc (n / 2);
> +  for (int i = 0; i < n/2; i++)

Formatting nit: n / 2

> +RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i)
> +   : XVECEXP (vals, 0, 2 * i + 1);
> +  return gen_rtx_PARALLEL (new_mode, vec);
> +}
> +
> +/* Expand a vector initialisation sequence, such that TARGET is

initialization

> +   initialized to contain VALS.  */
> +
> +void
> +aarch64_expand_vector_init (rtx target, rtx vals)
> +{
> +  /* Try decomposing the initializer into even and odd halves and
> + then ZIP them together.  Use the resulting sequence if it is
> + strictly cheaper than loading VALS directly.
> +
> + Prefer the fallback sequence in the event of a tie, since it
> + will tend to use fewer registers.  */
> +
> +  machine_mode mode = GET_MODE (target);
> +  int n_elts = XVECLEN (vals, 0);
> +
> +  if (n_elts < 4
> +  || maybe_ne (GET_MODE_BITSIZE (mode), 128))
> +{
> +  aarch64_expand_vector_init_fallback (target, vals);
> +  return;
> +}
> +
> +  start_sequence ();
> +  rtx halves[2];
> +  unsigned 

Re: [PATCH] aarch64: PR target/109406 Add support for SVE2 unpredicated MUL

2023-04-24 Thread Richard Sandiford via Gcc-patches
Kyrylo Tkachov  writes:
> Hi all,
>
> SVE2 supports an unpredicated vector integer MUL form that we can emit from 
> our SVE expanders
> without using up a predicate registers. This patch does so.
> As the SVE MUL expansion currently is templated away through a code iterator 
> I did not split it
> off just for this case but instead special-cased it in the define_expand. It 
> seemed somewhat less
> invasive than the alternatives but I could split it off more explicitly if 
> others want to.
> The div-by-bitmask_1.c testcase is adjusted to expect this new MUL form.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Ok for trunk?
> Thanks,
> Kyrill
>
> gcc/ChangeLog:
>
>   PR target/109406
>   * config/aarch64/aarch64-sve.md (3): Handle TARGET_SVE2 MUL
>   case.
>   * config/aarch64/aarch64-sve2.md (*aarch64_mul_unpredicated_): New
>   pattern.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/109406
>   * gcc.target/aarch64/sve2/div-by-bitmask_1.c: Adjust for unpredicated 
> SVE2
>   MUL.
>   * gcc.target/aarch64/sve2/unpred_mul_1.c: New test.

LGTM.

Thanks,
Richard

> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 
> b11b55f7ac718db199920b61bf3e4b4881c69660..4b4c02c90fec6ce1ff15a8b2a5df348224a307b7
>  100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -3657,6 +3657,15 @@ (define_expand "3"
> UNSPEC_PRED_X))]
>"TARGET_SVE"
>{
> +/* SVE2 supports the MUL (vectors, unpredicated) form.  Emit the simple
> +   pattern for it here rather than splitting off the MULT expander
> +   separately.  */
> +if (TARGET_SVE2 &&  == MULT)
> +  {
> + emit_move_insn (operands[0], gen_rtx_MULT (mode,
> +operands[1], operands[2]));
> + DONE;
> +  }
>  operands[3] = aarch64_ptrue_reg (mode);
>}
>  )
> diff --git a/gcc/config/aarch64/aarch64-sve2.md 
> b/gcc/config/aarch64/aarch64-sve2.md
> index 
> 2346f9f835d26f5b87afd47cdc9e44f9f47604ed..da8a424dd57fc5482cb20ba417d4141148ac61b6
>  100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -189,7 +189,7 @@ (define_insn 
> "@aarch64_scatter_stnt_"
>  ;; -
>  ;;  [INT] Multiplication
>  ;; -
> -;; Includes the lane forms of:
> +;; Includes the lane and unpredicated forms of:
>  ;; - MUL
>  ;; -
>  
> @@ -205,6 +205,21 @@ (define_insn "@aarch64_mul_lane_"
>"mul\t%0., %1., %2.[%3]"
>  )
>  
> +;; The 2nd and 3rd alternatives are valid for just TARGET_SVE as well but
> +;; we include them here to allow matching simpler, unpredicated RTL.
> +(define_insn "*aarch64_mul_unpredicated_"
> +  [(set (match_operand:SVE_I 0 "register_operand" "=w,w,?")
> + (mult:SVE_I
> +   (match_operand:SVE_I 1 "register_operand" "w,0,w")
> +   (match_operand:SVE_I 2 "aarch64_sve_vsm_operand" "w,vsm,vsm")))]
> +  "TARGET_SVE2"
> +  "@
> +   mul\t%0., %1., %2.
> +   mul\t%0., %0., #%2
> +   movprfx\t%0, %1\;mul\t%0., %0., #%2"
> +  [(set_attr "movprfx" "*,*,yes")]
> +)
> +
>  ;; -
>  ;;  [INT] Scaled high-part multiplication
>  ;; -
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/div-by-bitmask_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/div-by-bitmask_1.c
> index 
> e6f5098c30f4e2eb8ed1af153c0bb0d204cda6d9..1e546a93906962ba2469ddb3bf2ee9c0166dbae1
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve2/div-by-bitmask_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/div-by-bitmask_1.c
> @@ -7,7 +7,7 @@
>  /*
>  ** draw_bitmap1:
>  ** ...
> -**   mul z[0-9]+.h, p[0-9]+/m, z[0-9]+.h, z[0-9]+.h
> +**   mul z[0-9]+.h, z[0-9]+.h, z[0-9]+.h
>  **   addhnb  z[0-9]+.b, z[0-9]+.h, z[0-9]+.h
>  **   addhnb  z[0-9]+.b, z[0-9]+.h, z[0-9]+.h
>  ** ...
> @@ -27,7 +27,7 @@ void draw_bitmap2(uint8_t* restrict pixel, uint8_t level, 
> int n)
>  /*
>  ** draw_bitmap3:
>  ** ...
> -**   mul z[0-9]+.s, p[0-9]+/m, z[0-9]+.s, z[0-9]+.s
> +**   mul z[0-9]+.s, z[0-9]+.s, z[0-9]+.s
>  **   addhnb  z[0-9]+.h, z[0-9]+.s, z[0-9]+.s
>  **   addhnb  z[0-9]+.h, z[0-9]+.s, z[0-9]+.s
>  ** ...
> @@ -41,7 +41,7 @@ void draw_bitmap3(uint16_t* restrict pixel, uint16_t level, 
> int n)
>  /*
>  ** draw_bitmap4:
>  ** ...
> -**   mul z[0-9]+.d, p[0-9]+/m, z[0-9]+.d, z[0-9]+.d
> +**   mul z[0-9]+.d, z[0-9]+.d, z[0-9]+.d
>  **   addhnb  z[0-9]+.s, z[0-9]+.d, z[0-9]+.d
>  **   addhnb  z[0-9]+.s, z[0-9]+.d, z[0-9]+.d
>  ** ...
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/unpred_mul_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/unpred_mul_1.c
> new file mode 

Re: [PATCH] RFC: New compact syntax for insn and insn_split in Machine Descriptions

2023-04-24 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> Tamar Christina  writes:
>> Hi All,
>>
>> This patch adds support for a compact syntax for specifying constraints in
>> instruction patterns. Credit for the idea goes to Richard Earnshaw.
>>
>> I am sending up this RFC to get feedback for it's inclusion in GCC 14.
>> With this new syntax we want a clean break from the current limitations to 
>> make
>> something that is hopefully easier to use and maintain.
>>
>> The idea behind this compact syntax is that often times it's quite hard to
>> correlate the entries in the constrains list, attributes and instruction 
>> lists.
>>
>> One has to count and this often is tedious.  Additionally when changing a 
>> single
>> line in the insn multiple lines in a diff change, making it harder to see 
>> what's
>> going on.
>>
>> This new syntax takes into account many of the common things that are done 
>> in MD
>> files.   It's also worth saying that this version is intended to deal with 
>> the
>> common case of a string based alternatives.   For C chunks we have some ideas
>> but those are not intended to be addressed here.
>>
>> It's easiest to explain with an example:
>>
>> normal syntax:
>>
>> (define_insn_and_split "*movsi_aarch64"
>>   [(set (match_operand:SI 0 "nonimmediate_operand" "=r,k,r,r,r,r, r,w, m, m, 
>>  r,  r,  r, w,r,w, w")
>>  (match_operand:SI 1 "aarch64_mov_operand"  " 
>> r,r,k,M,n,Usv,m,m,rZ,w,Usw,Usa,Ush,rZ,w,w,Ds"))]
>>   "(register_operand (operands[0], SImode)
>> || aarch64_reg_or_zero (operands[1], SImode))"
>>   "@
>>mov\\t%w0, %w1
>>mov\\t%w0, %w1
>>mov\\t%w0, %w1
>>mov\\t%w0, %1
>>#
>>* return aarch64_output_sve_cnt_immediate (\"cnt\", \"%x0\", operands[1]);
>>ldr\\t%w0, %1
>>ldr\\t%s0, %1
>>str\\t%w1, %0
>>str\\t%s1, %0
>>adrp\\t%x0, %A1\;ldr\\t%w0, [%x0, %L1]
>>adr\\t%x0, %c1
>>adrp\\t%x0, %A1
>>fmov\\t%s0, %w1
>>fmov\\t%w0, %s1
>>fmov\\t%s0, %s1
>>* return aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);"
>>   "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
>> SImode)
>> && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
>>[(const_int 0)]
>>"{
>>aarch64_expand_mov_immediate (operands[0], operands[1]);
>>DONE;
>> }"
>>   ;; The "mov_imm" type for CNT is just a placeholder.
>>   [(set_attr "type" "mov_reg,mov_reg,mov_reg,mov_imm,mov_imm,mov_imm,load_4,
>>  
>> load_4,store_4,store_4,load_4,adr,adr,f_mcr,f_mrc,fmov,neon_move")
>>(set_attr "arch"   "*,*,*,*,*,sve,*,fp,*,fp,*,*,*,fp,fp,fp,simd")
>>(set_attr "length" "4,4,4,4,*,  4,4, 4,4, 4,8,4,4, 4, 4, 4,   4")
>> ]
>> )
>>
>> New syntax:
>>
>> (define_insn_and_split "*movsi_aarch64"
>>   [(set (match_operand:SI 0 "nonimmediate_operand")
>>  (match_operand:SI 1 "aarch64_mov_operand"))]
>>   "(register_operand (operands[0], SImode)
>> || aarch64_reg_or_zero (operands[1], SImode))"
>>   "@@ (cons: 0 1; attrs: type arch length)
>>[=r, r  ; mov_reg  , *   , 4] mov\t%w0, %w1
>>[k , r  ; mov_reg  , *   , 4] ^
>>[r , k  ; mov_reg  , *   , 4] ^
>>[r , M  ; mov_imm  , *   , 4] mov\t%w0, %1
>>[r , n  ; mov_imm  , *   , *] #
>>[r , Usv; mov_imm  , sve , 4] << aarch64_output_sve_cnt_immediate ('cnt', 
>> '%x0', operands[1]);
>>[r , m  ; load_4   , *   , 4] ldr\t%w0, %1
>>[w , m  ; load_4   , fp  , 4] ldr\t%s0, %1
>>[m , rZ ; store_4  , *   , 4] str\t%w1, %0
>>[m , w  ; store_4  , fp  , 4] str\t%s1, %0
>>[r , Usw; load_4   , *   , 8] adrp\t%x0, %A1;ldr\t%w0, [%x0, %L1]
>>[r , Usa; adr  , *   , 4] adr\t%x0, %c1
>>[r , Ush; adr  , *   , 4] adrp\t%x0, %A1
>>[w , rZ ; f_mcr, fp  , 4] fmov\t%s0, %w1
>>[r , w  ; f_mrc, fp  , 4] fmov\t%w0, %s1
>>[w , w  ; fmov , fp  , 4] fmov\t%s0, %s1
>>[w , Ds ; neon_move, simd, 4] << aarch64_output_scalar_simd_mov_immediate 
>> (operands[1], SImode);"
>>   "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
>> SImode)
>> && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
>>   [(const_int 0)]
>>   {
>> aarch64_expand_mov_immediate (operands[0], operands[1]);
>> DONE;
>>   }
>>   ;; The "mov_imm" type for CNT is just a placeholder.
>> )
>>
>> The patch contains some more rewritten examples for both Arm and AArch64.  I
>> have included them for examples in this RFC but the final version posted in
>> GCC 14 will have these split out.
>>
>> The main syntax rules are as follows (See docs for full rules):
>>   - Template must start with "@@" to use the new syntax.
>>   - "@@" is followed by a layout in parentheses which is "cons:" followed by
>> a list of match_operand/match_scratch IDs, then a semicolon, then the
>> same for attributes ("attrs:"). Both sections are optional (so you can
>> use only cons, or only attrs, or both), and cons must come before attrs
>> if present.
>>   - Each alternative begins with any 

Re: [PATCH] RFC: New compact syntax for insn and insn_split in Machine Descriptions

2023-04-21 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> This patch adds support for a compact syntax for specifying constraints in
> instruction patterns. Credit for the idea goes to Richard Earnshaw.
>
> I am sending up this RFC to get feedback for it's inclusion in GCC 14.
> With this new syntax we want a clean break from the current limitations to 
> make
> something that is hopefully easier to use and maintain.
>
> The idea behind this compact syntax is that often times it's quite hard to
> correlate the entries in the constrains list, attributes and instruction 
> lists.
>
> One has to count and this often is tedious.  Additionally when changing a 
> single
> line in the insn multiple lines in a diff change, making it harder to see 
> what's
> going on.
>
> This new syntax takes into account many of the common things that are done in 
> MD
> files.   It's also worth saying that this version is intended to deal with the
> common case of a string based alternatives.   For C chunks we have some ideas
> but those are not intended to be addressed here.
>
> It's easiest to explain with an example:
>
> normal syntax:
>
> (define_insn_and_split "*movsi_aarch64"
>   [(set (match_operand:SI 0 "nonimmediate_operand" "=r,k,r,r,r,r, r,w, m, m,  
> r,  r,  r, w,r,w, w")
>   (match_operand:SI 1 "aarch64_mov_operand"  " 
> r,r,k,M,n,Usv,m,m,rZ,w,Usw,Usa,Ush,rZ,w,w,Ds"))]
>   "(register_operand (operands[0], SImode)
> || aarch64_reg_or_zero (operands[1], SImode))"
>   "@
>mov\\t%w0, %w1
>mov\\t%w0, %w1
>mov\\t%w0, %w1
>mov\\t%w0, %1
>#
>* return aarch64_output_sve_cnt_immediate (\"cnt\", \"%x0\", operands[1]);
>ldr\\t%w0, %1
>ldr\\t%s0, %1
>str\\t%w1, %0
>str\\t%s1, %0
>adrp\\t%x0, %A1\;ldr\\t%w0, [%x0, %L1]
>adr\\t%x0, %c1
>adrp\\t%x0, %A1
>fmov\\t%s0, %w1
>fmov\\t%w0, %s1
>fmov\\t%s0, %s1
>* return aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);"
>   "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
> SImode)
> && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
>[(const_int 0)]
>"{
>aarch64_expand_mov_immediate (operands[0], operands[1]);
>DONE;
> }"
>   ;; The "mov_imm" type for CNT is just a placeholder.
>   [(set_attr "type" "mov_reg,mov_reg,mov_reg,mov_imm,mov_imm,mov_imm,load_4,
>   
> load_4,store_4,store_4,load_4,adr,adr,f_mcr,f_mrc,fmov,neon_move")
>(set_attr "arch"   "*,*,*,*,*,sve,*,fp,*,fp,*,*,*,fp,fp,fp,simd")
>(set_attr "length" "4,4,4,4,*,  4,4, 4,4, 4,8,4,4, 4, 4, 4,   4")
> ]
> )
>
> New syntax:
>
> (define_insn_and_split "*movsi_aarch64"
>   [(set (match_operand:SI 0 "nonimmediate_operand")
>   (match_operand:SI 1 "aarch64_mov_operand"))]
>   "(register_operand (operands[0], SImode)
> || aarch64_reg_or_zero (operands[1], SImode))"
>   "@@ (cons: 0 1; attrs: type arch length)
>[=r, r  ; mov_reg  , *   , 4] mov\t%w0, %w1
>[k , r  ; mov_reg  , *   , 4] ^
>[r , k  ; mov_reg  , *   , 4] ^
>[r , M  ; mov_imm  , *   , 4] mov\t%w0, %1
>[r , n  ; mov_imm  , *   , *] #
>[r , Usv; mov_imm  , sve , 4] << aarch64_output_sve_cnt_immediate ('cnt', 
> '%x0', operands[1]);
>[r , m  ; load_4   , *   , 4] ldr\t%w0, %1
>[w , m  ; load_4   , fp  , 4] ldr\t%s0, %1
>[m , rZ ; store_4  , *   , 4] str\t%w1, %0
>[m , w  ; store_4  , fp  , 4] str\t%s1, %0
>[r , Usw; load_4   , *   , 8] adrp\t%x0, %A1;ldr\t%w0, [%x0, %L1]
>[r , Usa; adr  , *   , 4] adr\t%x0, %c1
>[r , Ush; adr  , *   , 4] adrp\t%x0, %A1
>[w , rZ ; f_mcr, fp  , 4] fmov\t%s0, %w1
>[r , w  ; f_mrc, fp  , 4] fmov\t%w0, %s1
>[w , w  ; fmov , fp  , 4] fmov\t%s0, %s1
>[w , Ds ; neon_move, simd, 4] << aarch64_output_scalar_simd_mov_immediate 
> (operands[1], SImode);"
>   "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
> SImode)
> && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
>   [(const_int 0)]
>   {
> aarch64_expand_mov_immediate (operands[0], operands[1]);
> DONE;
>   }
>   ;; The "mov_imm" type for CNT is just a placeholder.
> )
>
> The patch contains some more rewritten examples for both Arm and AArch64.  I
> have included them for examples in this RFC but the final version posted in
> GCC 14 will have these split out.
>
> The main syntax rules are as follows (See docs for full rules):
>   - Template must start with "@@" to use the new syntax.
>   - "@@" is followed by a layout in parentheses which is "cons:" followed by
> a list of match_operand/match_scratch IDs, then a semicolon, then the
> same for attributes ("attrs:"). Both sections are optional (so you can
> use only cons, or only attrs, or both), and cons must come before attrs
> if present.
>   - Each alternative begins with any amount of whitespace.
>   - Following the whitespace is a comma-separated list of constraints and/or
> attributes within brackets [], with sections 

Re: [match.pd] [SVE] Add pattern to transform svrev(svrev(v)) --> v

2023-04-21 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Wed, 19 Apr 2023 at 16:17, Richard Biener  
> wrote:
>>
>> On Wed, Apr 19, 2023 at 11:21 AM Prathamesh Kulkarni
>>  wrote:
>> >
>> > On Tue, 11 Apr 2023 at 19:36, Prathamesh Kulkarni
>> >  wrote:
>> > >
>> > > On Tue, 11 Apr 2023 at 14:17, Richard Biener 
>> > >  wrote:
>> > > >
>> > > > On Wed, Apr 5, 2023 at 10:39 AM Prathamesh Kulkarni via Gcc-patches
>> > > >  wrote:
>> > > > >
>> > > > > Hi,
>> > > > > For the following test:
>> > > > >
>> > > > > svint32_t f(svint32_t v)
>> > > > > {
>> > > > >   return svrev_s32 (svrev_s32 (v));
>> > > > > }
>> > > > >
>> > > > > We generate 2 rev instructions instead of nop:
>> > > > > f:
>> > > > > rev z0.s, z0.s
>> > > > > rev z0.s, z0.s
>> > > > > ret
>> > > > >
>> > > > > The attached patch tries to fix that by trying to recognize the 
>> > > > > following
>> > > > > pattern in match.pd:
>> > > > > v1 = VEC_PERM_EXPR (v0, v0, mask)
>> > > > > v2 = VEC_PERM_EXPR (v1, v1, mask)
>> > > > > -->
>> > > > > v2 = v0
>> > > > > if mask is { nelts - 1, nelts - 2, nelts - 3, ... }
>> > > > >
>> > > > > Code-gen with patch:
>> > > > > f:
>> > > > > ret
>> > > > >
>> > > > > Bootstrap+test passes on aarch64-linux-gnu, and SVE bootstrap in 
>> > > > > progress.
>> > > > > Does it look OK for stage-1 ?
>> > > >
>> > > > I didn't look at the patch but 
>> > > > tree-ssa-forwprop.cc:simplify_permutation should
>> > > > handle two consecutive permutes with the 
>> > > > is_combined_permutation_identity
>> > > > which might need tweaking for VLA vectors
>> > > Hi Richard,
>> > > Thanks for the suggestions. The attached patch modifies
>> > > is_combined_permutation_identity
>> > > to recognize the above pattern.
>> > > Does it look OK ?
>> > > Bootstrap+test in progress on aarch64-linux-gnu and x86_64-linux-gnu.
>> > Hi,
>> > ping https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615502.html
>>
>> Can you instead of def_stmt pass in a bool whether rhs1 is equal to rhs2
>> and amend the function comment accordingly, say,
>>
>>   tem = VEC_PERM ;
>>   res = VEC_PERM ;
>>
>> SAME_P specifies whether op0 and op1 compare equal.  */
>>
>> +  if (def_stmt)
>> +gcc_checking_assert (is_gimple_assign (def_stmt)
>> +&& gimple_assign_rhs_code (def_stmt) == 
>> VEC_PERM_EXPR);
>> this is then unnecessary
>>
>>mask = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (mask1), mask1, mask1, 
>> mask2);
>> +
>> +  /* For VLA masks, check for the following pattern:
>> + v1 = VEC_PERM_EXPR (v0, v0, mask)
>> + v2 = VEC_PERM_EXPR (v1, v1, mask)
>> + -->
>> + v2 = v0
>>
>> you are not using 'mask' so please defer fold_ternary until after your
>> special-case.
>>
>> +  if (operand_equal_p (mask1, mask2, 0)
>> +  && !VECTOR_CST_NELTS (mask1).is_constant ()
>> +  && def_stmt
>> +  && operand_equal_p (gimple_assign_rhs1 (def_stmt),
>> + gimple_assign_rhs2 (def_stmt), 0))
>> +{
>> +  vec_perm_builder builder;
>> +  if (tree_to_vec_perm_builder (, mask1))
>> +   {
>> + poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask1));
>> + vec_perm_indices sel (builder, 1, nelts);
>> + if (sel.series_p (0, 1, nelts - 1, -1))
>> +   return 1;
>> +   }
>> +  return 0;
>>
>> I'm defering to Richard whether this is the correct way to check for a vector
>> reversing mask (I wonder how constructing such mask is even possible)
> Hi Richard,
> Thanks for the suggestions, I have updated the patch accordingly.
>
> The following hunk from svrev_impl::fold() constructs mask in reverse:
> /* Permute as { nelts - 1, nelts - 2, nelts - 3, ... }.  */
> poly_int64 nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (f.lhs));
> vec_perm_builder builder (nelts, 1, 3);
> for (int i = 0; i < 3; ++i)
>   builder.quick_push (nelts - i - 1);
> return fold_permute (f, builder);
>
> To see if mask chooses elements in reverse, I borrowed it from function 
> comment
> for series_p in vec-perm-indices.cc:
> /* Return true if index OUT_BASE + I * OUT_STEP selects input
>element IN_BASE + I * IN_STEP.  For example, the call to test
>whether a permute reverses a vector of N elements would be:
>
>  series_p (0, 1, N - 1, -1)
>
>which would return true for { N - 1, N - 2, N - 3, ... }.  */
>
> Thanks,
> Prathamesh
>>
>> Richard.
>>
>> > Thanks,
>> > Prathamesh
>> > >
>> > > Thanks,
>> > > Prathamesh
>> > > >
>> > > > Richard.
>> > > >
>> > > > >
>> > > > > Thanks,
>> > > > > Prathamesh
>
> gcc/ChangeLog:
>   * tree-ssa-forwprop.cc (is_combined_permutation_identity):
>   New parameter same_p.
>   Try to simplify two successive VEC_PERM_EXPRs with single operand
>   and same mask, where mask chooses elements in reverse order.
>
> gcc/testesuite/ChangeLog:
>   * gcc.target/aarch64/sve/acle/general/rev-1.c: New test.
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/rev-1.c 
> 

Re: [PATCH v2] Leveraging the use of STP instruction for vec_duplicate

2023-04-21 Thread Richard Sandiford via Gcc-patches
"Victor L. Do Nascimento"  writes:
> The backend pattern for storing a pair of identical values in 32 and
> 64-bit modes with the machine instruction STP was missing, and
> multiple instructions were needed to reproduce this behavior as a
> result of failed RTL pattern match in the combine pass.
>
> For the test case:
>
> typedef long long v2di __attribute__((vector_size (16)));
> typedef int v2si __attribute__((vector_size (8)));
>
> void
> foo (v2di *x, long long a)
> {
>   v2di tmp = {a, a};
>   *x = tmp;
> }
>
> void
> foo2 (v2si *x, int a)
> {
>   v2si tmp = {a, a};
>   *x = tmp;
> }
>
> at -O2 on aarch64 gives:
>
> foo
>   stp x1, x1, [x0]
>   ret
> foo2:
>   stp w1, w1, [x0]
>   ret
>
> instead of:
>
> foo:
>   dup v0.2d, x1
>   str q0, [x0]
>   ret
> foo2:
>   dup v0.2s, w1
>   str d0, [x0]
>   ret
>
> Bootstrapped and regtested on aarch64-none-linux-gnu.  Ok to install?
>
> gcc/
>   * config/aarch64/aarch64-simd.md(aarch64_simd_stp): New.
>   * config/aarch64/constraints.md: Make "Umn" relaxed memory
>   constraint.
>   * config/aarch64/iterators.md(ldpstp_vel_sz): New.
>
> gcc/testsuite/
>   * gcc.target/aarch64/stp_vec_dup_32_64-1.c:

Nit: missing text after ":"

OK to install with that fixed, thanks.  Please follow
https://gcc.gnu.org/gitwrite.html to get write access.

Richard

> ---
>  gcc/config/aarch64/aarch64-simd.md| 10 
>  gcc/config/aarch64/constraints.md |  2 +-
>  gcc/config/aarch64/iterators.md   |  3 +
>  .../gcc.target/aarch64/stp_vec_dup_32_64-1.c  | 57 +++
>  4 files changed, 71 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index de2b7383749..8b5e67bd100 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -229,6 +229,16 @@
>[(set_attr "type" "neon_stp")]
>  )
>  
> +(define_insn "aarch64_simd_stp"
> +  [(set (match_operand:VP_2E 0 "aarch64_mem_pair_lanes_operand" "=Umn,Umn")
> + (vec_duplicate:VP_2E (match_operand: 1 "register_operand" "w,r")))]
> +  "TARGET_SIMD"
> +  "@
> +   stp\\t%1, %1, %y0
> +   stp\\t%1, %1, %y0"
> +  [(set_attr "type" "neon_stp, store_")]
> +)
> +
>  (define_insn "load_pair"
>[(set (match_operand:VQ 0 "register_operand" "=w")
>   (match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
> diff --git a/gcc/config/aarch64/constraints.md 
> b/gcc/config/aarch64/constraints.md
> index 5b20abc27e5..6df1dbec2a8 100644
> --- a/gcc/config/aarch64/constraints.md
> +++ b/gcc/config/aarch64/constraints.md
> @@ -287,7 +287,7 @@
>  ;; Used for storing or loading pairs in an AdvSIMD register using an STP/LDP
>  ;; as a vector-concat.  The address mode uses the same constraints as if it
>  ;; were for a single value.
> -(define_memory_constraint "Umn"
> +(define_relaxed_memory_constraint "Umn"
>"@internal
>A memory address suitable for a load/store pair operation."
>(and (match_code "mem")
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 6cbc97cc82c..980dacb8025 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1017,6 +1017,9 @@
>  ;; Likewise for load/store pair.
>  (define_mode_attr ldpstp_sz [(SI "8") (DI "16")])
>  
> +;; Size of element access for STP/LDP-generated vectors.
> +(define_mode_attr ldpstp_vel_sz [(V2SI "8") (V2SF "8") (V2DI "16") (V2DF 
> "16")])
> +
>  ;; For inequal width int to float conversion
>  (define_mode_attr w1 [(HF "w") (SF "w") (DF "x")])
>  (define_mode_attr w2 [(HF "x") (SF "x") (DF "w")])
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c 
> b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
> new file mode 100644
> index 000..fc2c1ea39e0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
> @@ -0,0 +1,57 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +typedef long long v2di __attribute__((vector_size (16)));
> +typedef int v2si __attribute__((vector_size (8)));
> +
> +#define TESTV2DI(lab, idx)   \
> +  void   \
> +  stpv2di_##lab (v2di *x, long long a)   \
> +  {  \
> +v2di tmp = {a, a};   \
> +x[idx] = tmp;\
> +  }
> +
> +
> +#define TESTV2SI(lab, idx)   \
> +  void   \
> +  stpv2si_##lab (v2si *x, int a) \
> +  {  \
> +v2si tmp = {a, a};   \
> +x[idx] = tmp;\
> +  }  \
> +
> +/* Core test, no imm assembler offset:  */
> +
> +TESTV2SI(0, 0)
> +TESTV2DI(0, 0)
> +/* { dg-final { 

Re: [aarch64] Use force_reg instead of copy_to_mode_reg

2023-04-21 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> Hi Richard,
> Based on your suggestions in the other thread, the patch uses force_reg
> to avoid creating pseudo if value is already in a register.
> Bootstrap+test passes on aarch64-linux-gnu.
> OK to commit ?
>
> Thanks,
> Prathamesh
>
> [aarch64] Use force_reg instead of copy_to_mode_reg.
>
> Use force_reg instead of copy_to_mode_reg in aarch64_simd_dup_constant
> and aarch64_expand_vector_init to avoid creating pseudo if original value
> is already in a register.
>
> gcc/ChangeLog:
>   * config/aarch64/aarch64.cc (aarch64_simd_dup_constant): Use
>   force_reg instead of copy_to_mode_reg.
>   (aarch64_expand_vector_init): Likewise.

OK, thanks.

Richard

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 0d7470c05a1..321580d7f6a 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -21968,7 +21968,7 @@ aarch64_simd_dup_constant (rtx vals)
>/* We can load this constant by using DUP and a constant in a
>   single ARM register.  This will be cheaper than a vector
>   load.  */
> -  x = copy_to_mode_reg (inner_mode, x);
> +  x = force_reg (inner_mode, x);
>return gen_vec_duplicate (mode, x);
>  }
>  
> @@ -22082,7 +22082,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>/* Splat a single non-constant element if we can.  */
>if (all_same)
>  {
> -  rtx x = copy_to_mode_reg (inner_mode, v0);
> +  rtx x = force_reg (inner_mode, v0);
>aarch64_emit_move (target, gen_vec_duplicate (mode, x));
>return;
>  }
> @@ -22190,12 +22190,12 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>vector register.  For big-endian we want that position to hold
>the last element of VALS.  */
> maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> -   rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +   rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>   }
>else
>   {
> -   rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +   rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> aarch64_emit_move (target, gen_vec_duplicate (mode, x));
>   }
>  
> @@ -22205,7 +22205,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> rtx x = XVECEXP (vals, 0, i);
> if (matches[i][0] == maxelement)
>   continue;
> -   x = copy_to_mode_reg (inner_mode, x);
> +   x = force_reg (inner_mode, x);
> emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
>   }
>return;
> @@ -22249,7 +22249,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>rtx x = XVECEXP (vals, 0, i);
>if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
>   continue;
> -  x = copy_to_mode_reg (inner_mode, x);
> +  x = force_reg (inner_mode, x);
>emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
>  }
>  }


Re: [RFC 0/X] Implement GCC support for AArch64 libmvec

2023-04-21 Thread Richard Sandiford via Gcc-patches
"Andre Vieira (lists)"  writes:
> On 20/04/2023 17:13, Richard Sandiford wrote:
>> "Andre Vieira (lists)"  writes:
>>> On 20/04/2023 15:51, Richard Sandiford wrote:
 "Andre Vieira (lists)"  writes:
> Hi all,
>
> This is a series of patches/RFCs to implement support in GCC to be able
> to target AArch64's libmvec functions that will be/are being added to 
> glibc.
> We have chosen to use the omp pragma '#pragma omp declare variant ...'
> with a simd construct as the way for glibc to inform GCC what functions
> are available.
>
> For example, if we would like to supply a vector version of the scalar
> 'cosf' we would have an include file with something like:
> typedef __attribute__((__neon_vector_type__(4))) float __f32x4_t;
> typedef __attribute__((__neon_vector_type__(2))) float __f32x2_t;
> typedef __SVFloat32_t __sv_f32_t;
> typedef __SVBool_t __sv_bool_t;
> __f32x4_t _ZGVnN4v_cosf (__f32x4_t);
> __f32x2_t _ZGVnN2v_cosf (__f32x2_t);
> __sv_f32_t _ZGVsMxv_cosf (__sv_f32_t, __sv_bool_t);
> #pragma omp declare variant(_ZGVnN4v_cosf) \
>match(construct = {simd(notinbranch, simdlen(4))}, device =
> {isa("simd")})
> #pragma omp declare variant(_ZGVnN2v_cosf) \
>match(construct = {simd(notinbranch, simdlen(2))}, device =
> {isa("simd")})
> #pragma omp declare variant(_ZGVsMxv_cosf) \
>match(construct = {simd(inbranch)}, device = {isa("sve")})
> extern float cosf (float);
>
> The BETA ABI can be found in the vfabia64 subdir of
> https://github.com/ARM-software/abi-aa/
> This currently disagrees with how this patch series implements 'omp
> declare simd' for SVE and I also do not see a need for the 'omp declare
> variant' scalable extension constructs. I will make changes to the ABI
> once we've finalized the co-design of the ABI and this implementation.

 I don't see a good reason for dropping the extension("scalable").
 The problem is that since the base spec requires a simdlen clause,
 GCC should in general raise an error if simdlen is omitted.
>>> Where can you find this in the specs? I tried to find it but couldn't.
>>>
>>> Leaving out simdlen in a 'omp declare simd' I assume is OK, our vector
>>> ABI defines behaviour for this. But I couldn't find what it meant for a
>>> omp declare variant, obviously can't be the same as for declare simd, as
>>> that is defined to mean 'define a set of clones' and only one clone can
>>> be associated to a declare variant.
>> 
>> I was going from https://www.openmp.org/spec-html/5.0/openmpsu25.html ,
>> which says:
>> 
>>The simd trait can be further defined with properties that match the
>>clauses accepted by the declare simd directive with the same name and
>>semantics. The simd trait must define at least the simdlen property and
>>one of the inbranch or notinbranch properties.
>> 
>> (probably best to read it in the original -- it's almost incomprehensible
>> without markup)
>> 
> I'm guessing the keyword here is 'trait' which I'm guessing is different 
> from a omp declare simd directive, which is why it's not required to 
> have a simdlen clause in an omp declare simd (see Jakub's comment).

Sure.  The thread above is about whether we need extension("scalable")
or should drop it.  And extension("scalable") is only used in omp
declare variant.  This was in response to "I also do not see a need
for the 'omp declare variant' scalable extension constructs".

Not having a simdlen on an omp declare simd is of course OK (and the
VFABI defines behaviour for that case).

Richard


Re: [PATCH] MAINTAINERS: add Vineet Gupta to write after approval

2023-04-21 Thread Richard Sandiford via Gcc-patches
Palmer Dabbelt  writes:
> On Thu, 20 Apr 2023 09:55:23 PDT (-0700), Vineet Gupta wrote:
>> ChangeLog:
>>
>>  * MAINTAINERS (Write After Approval): Add myself.
>>
>> (Ref: <680c7bbe-5d6e-07cd-8468-247afc65e...@gmail.com>)
>>
>> Signed-off-by: Vineet Gupta 
>> ---
>>  MAINTAINERS | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cebf45d49e56..5f25617212a5 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -434,6 +434,7 @@ Haochen Gui  
>> 
>>  Jiufu Guo   
>>  Xuepeng Guo 
>>  Wei Guozhi  
>> +Vineet Gupta
>>  Naveen H.S  
>>  Mostafa Hagog   
>>  Andrew Haley
>
> Acked-by: Palmer Dabbelt 
>
> Though not sure if I can do that, maybe we need a global reviewer?

No approval is needed when adding oneself to write-after-approval.
The fact that one's able to make the change is proof enough.

Richard


Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

2023-04-21 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> Hi,
> I tested the interleave+zip1 for vector init patch and it segfaulted
> during bootstrap while trying to build
> libgfortran/generated/matmul_i2.c.
> Rebuilding with --enable-checking=rtl showed out of bounds access in
> aarch64_unzip_vector_init in following hunk:
>
> +  rtvec vec = rtvec_alloc (n / 2);
> +  for (int i = 0; i < n; i++)
> +RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i)
> + : XVECEXP (vals, 0, 2 * i + 1);
>
> which is incorrect since it allocates n/2 but iterates and stores upto n.
> The attached patch fixes the issue, which passed bootstrap, however
> resulted in following fallout during testsuite run:
>
> 1] sve/acle/general/dupq_[1-4].c tests fail.
> For the following test:
> int32x4_t f(int32_t x)
> {
>   return (int32x4_t) { x, 1, 2, 3 };
> }
>
> Code-gen without patch:
> f:
> adrpx1, .LC0
> ldr q0, [x1, #:lo12:.LC0]
> ins v0.s[0], w0
> ret
>
> Code-gen with patch:
> f:
> moviv0.2s, 0x2
> adrpx1, .LC0
> ldr d1, [x1, #:lo12:.LC0]
> ins v0.s[0], w0
> zip1v0.4s, v0.4s, v1.4s
> ret
>
> It shows, fallback_seq_cost = 20, seq_total_cost = 16
> where seq_total_cost determines the cost for interleave+zip1 sequence
> and fallback_seq_cost is the cost for fallback sequence.
> Altho it shows lesser cost, I am not sure if the interleave+zip1
> sequence is better in this case ?

Debugging the patch, it looks like this is because the fallback sequence
contains a redundant pseudo-to-pseudo move, which is costed as 1
instruction (4 units).  The RTL equivalent of the:

 moviv0.2s, 0x2
 ins v0.s[0], w0

has a similar redundant move, but the cost of that move is subsumed by
the cost of the other arm (the load from LC0), which is costed as 3
instructions (12 units).  So we have 12 + 4 for the parallel version
(correct) but 12 + 4 + 4 for the serial version (one instruction too
many).

The reason we have redundant moves is that the expansion code uses
copy_to_mode_reg to force a value into a register.  This creates a
new pseudo even if the original value was already a register.
Using force_reg removes the moves and makes the test pass.

So I think the first step is to use force_reg instead of
copy_to_mode_reg in aarch64_simd_dup_constant and
aarch64_expand_vector_init (as a preparatory patch).

> 2] sve/acle/general/dupq_[5-6].c tests fail:
> int32x4_t f(int32_t x0, int32_t x1, int32_t x2, int32_t x3)
> {
>   return (int32x4_t) { x0, x1, x2, x3 };
> }
>
> code-gen without patch:
> f:
> fmovs0, w0
> ins v0.s[1], w1
> ins v0.s[2], w2
> ins v0.s[3], w3
> ret
>
> code-gen with patch:
> f:
> fmovs0, w0
> fmovs1, w1
> ins v0.s[1], w2
> ins v1.s[1], w3
> zip1v0.4s, v0.4s, v1.4s
> ret
>
> It shows fallback_seq_cost = 28, seq_total_cost = 16

The zip verson still wins after the fix above, but by a lesser amount.
It seems like a borderline case.

>
> 3] aarch64/ldp_stp_16.c's cons2_8_float test fails.
> Test case:
> void cons2_8_float(float *x, float val0, float val1)
> {
> #pragma GCC unroll(8)
>   for (int i = 0; i < 8 * 2; i += 2) {
> x[i + 0] = val0;
> x[i + 1] = val1;
>   }
> }
>
> which is lowered to:
> void cons2_8_float (float * x, float val0, float val1)
> {
>   vector(4) float _86;
>
>[local count: 119292720]:
>   _86 = {val0_11(D), val1_13(D), val0_11(D), val1_13(D)};
>   MEM  [(float *)x_10(D)] = _86;
>   MEM  [(float *)x_10(D) + 16B] = _86;
>   MEM  [(float *)x_10(D) + 32B] = _86;
>   MEM  [(float *)x_10(D) + 48B] = _86;
>   return;
> }
>
> code-gen without patch:
> cons2_8_float:
> dup v0.4s, v0.s[0]
> ins v0.s[1], v1.s[0]
> ins v0.s[3], v1.s[0]
> stp q0, q0, [x0]
> stp q0, q0, [x0, 32]
> ret
>
> code-gen with patch:
> cons2_8_float:
> dup v1.2s, v1.s[0]
> dup v0.2s, v0.s[0]
> zip1v0.4s, v0.4s, v1.4s
> stp q0, q0, [x0]
> stp q0, q0, [x0, 32]
> ret
>
> It shows fallback_seq_cost = 28, seq_total_cost = 16
>
> I think the test fails because it doesn't match:
> **  dup v([0-9]+)\.4s, .*
>
> Shall it be OK to amend the test assuming code-gen with patch is better ?

Yeah, the new code seems like an improvement.

> 4] aarch64/pr109072_1.c s32x4_3 test fails:
> For the following test:
> int32x4_t s32x4_3 (int32_t x, int32_t y)
> {
>   int32_t arr[] = { x, y, y, y };
>   return vld1q_s32 (arr);
> }
>
> code-gen without patch:
> s32x4_3:
> dup v0.4s, w1
> ins v0.s[0], w0
> ret
>
> code-gen with patch:
> s32x4_3:
> fmovs1, w1
> fmovs0, w0
> ins v0.s[1], v1.s[0]
> dup v1.2s, v1.s[0]
> zip1v0.4s, v0.4s, v1.4s
> ret
>
> It shows 

Re: [RFC 0/X] Implement GCC support for AArch64 libmvec

2023-04-20 Thread Richard Sandiford via Gcc-patches
"Andre Vieira (lists)"  writes:
> On 20/04/2023 15:51, Richard Sandiford wrote:
>> "Andre Vieira (lists)"  writes:
>>> Hi all,
>>>
>>> This is a series of patches/RFCs to implement support in GCC to be able
>>> to target AArch64's libmvec functions that will be/are being added to glibc.
>>> We have chosen to use the omp pragma '#pragma omp declare variant ...'
>>> with a simd construct as the way for glibc to inform GCC what functions
>>> are available.
>>>
>>> For example, if we would like to supply a vector version of the scalar
>>> 'cosf' we would have an include file with something like:
>>> typedef __attribute__((__neon_vector_type__(4))) float __f32x4_t;
>>> typedef __attribute__((__neon_vector_type__(2))) float __f32x2_t;
>>> typedef __SVFloat32_t __sv_f32_t;
>>> typedef __SVBool_t __sv_bool_t;
>>> __f32x4_t _ZGVnN4v_cosf (__f32x4_t);
>>> __f32x2_t _ZGVnN2v_cosf (__f32x2_t);
>>> __sv_f32_t _ZGVsMxv_cosf (__sv_f32_t, __sv_bool_t);
>>> #pragma omp declare variant(_ZGVnN4v_cosf) \
>>>   match(construct = {simd(notinbranch, simdlen(4))}, device =
>>> {isa("simd")})
>>> #pragma omp declare variant(_ZGVnN2v_cosf) \
>>>   match(construct = {simd(notinbranch, simdlen(2))}, device =
>>> {isa("simd")})
>>> #pragma omp declare variant(_ZGVsMxv_cosf) \
>>>   match(construct = {simd(inbranch)}, device = {isa("sve")})
>>> extern float cosf (float);
>>>
>>> The BETA ABI can be found in the vfabia64 subdir of
>>> https://github.com/ARM-software/abi-aa/
>>> This currently disagrees with how this patch series implements 'omp
>>> declare simd' for SVE and I also do not see a need for the 'omp declare
>>> variant' scalable extension constructs. I will make changes to the ABI
>>> once we've finalized the co-design of the ABI and this implementation.
>> 
>> I don't see a good reason for dropping the extension("scalable").
>> The problem is that since the base spec requires a simdlen clause,
>> GCC should in general raise an error if simdlen is omitted.
> Where can you find this in the specs? I tried to find it but couldn't.
>
> Leaving out simdlen in a 'omp declare simd' I assume is OK, our vector 
> ABI defines behaviour for this. But I couldn't find what it meant for a 
> omp declare variant, obviously can't be the same as for declare simd, as 
> that is defined to mean 'define a set of clones' and only one clone can 
> be associated to a declare variant.

I was going from https://www.openmp.org/spec-html/5.0/openmpsu25.html ,
which says:

  The simd trait can be further defined with properties that match the
  clauses accepted by the declare simd directive with the same name and
  semantics. The simd trait must define at least the simdlen property and
  one of the inbranch or notinbranch properties.

(probably best to read it in the original -- it's almost incomprehensible
without markup)

Richard


Re: [PATCH 1/X] omp: Replace simd_clone_subparts with TYPE_VECTOR_SUBPARTS

2023-04-20 Thread Richard Sandiford via Gcc-patches
"Andre Vieira (lists)"  writes:
> Hi,
>
> This patch replaces the uses of simd_clone_subparts with 
> TYPE_VECTOR_SUBPARTS and removes the definition of the first.
>
> gcc/ChangeLog:
>
>  * omp-sind-clone.cc (simd_clone_subparts): Remove.
>  (simd_clone_init_simd_arrays): Replace simd_clone_subparts with 
> TYPE_VECTOR_SUBPARTS.
>  (ipa_simd_modify_function_body): Likewise.
>  * tree-vect-stmts.cc (simd_clone_subparts): Remove.
>  (vectorizable_simd_clone_call): Replace simd_clone_subparts 
> with TYPE_VECTOR_SUBPARTS.
>
> diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
> index 
> 0949b8ba288dfc7e7692403bfc600983faddf5dd..48b480e7556d9ad8e5502e10e513ec36b17b9cbb
>  100644
> --- a/gcc/omp-simd-clone.cc
> +++ b/gcc/omp-simd-clone.cc
> @@ -255,16 +255,6 @@ ok_for_auto_simd_clone (struct cgraph_node *node)
>return true;
>  }
>  
> -
> -/* Return the number of elements in vector type VECTYPE, which is associated
> -   with a SIMD clone.  At present these always have a constant length.  */
> -
> -static unsigned HOST_WIDE_INT
> -simd_clone_subparts (tree vectype)
> -{
> -  return TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> -}
> -
>  /* Allocate a fresh `simd_clone' and return it.  NARGS is the number
> of arguments to reserve space for.  */
>  
> @@ -1027,7 +1017,7 @@ simd_clone_init_simd_arrays (struct cgraph_node *node,
>   }
> continue;
>   }
> -  if (known_eq (simd_clone_subparts (TREE_TYPE (arg)),
> +  if (known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)),
>   node->simdclone->simdlen))
>   {
> tree ptype = build_pointer_type (TREE_TYPE (TREE_TYPE (array)));
> @@ -1039,7 +1029,7 @@ simd_clone_init_simd_arrays (struct cgraph_node *node,
>   }
>else
>   {
> -   unsigned int simdlen = simd_clone_subparts (TREE_TYPE (arg));
> +   poly_uint64 simdlen = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg));
> unsigned int times = vector_unroll_factor (node->simdclone->simdlen,
>simdlen);
> tree ptype = build_pointer_type (TREE_TYPE (TREE_TYPE (array)));
> @@ -1225,9 +1215,9 @@ ipa_simd_modify_function_body (struct cgraph_node *node,
> iter, NULL_TREE, NULL_TREE);
>adjustments->register_replacement (&(*adjustments->m_adj_params)[j], 
> r);
>  
> -  if (multiple_p (node->simdclone->simdlen, simd_clone_subparts 
> (vectype)))
> +  if (multiple_p (node->simdclone->simdlen, TYPE_VECTOR_SUBPARTS 
> (vectype)))
>   j += vector_unroll_factor (node->simdclone->simdlen,
> -simd_clone_subparts (vectype)) - 1;
> +TYPE_VECTOR_SUBPARTS (vectype)) - 1;
>  }
>adjustments->sort_replacements ();
>  
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 
> df6239a1c61c7213ad3c1468723bc1adf70bc02c..c85b6babc4bc5bc3111ef326dcc8f32bb25333f6
>  100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -3964,16 +3964,6 @@ vect_simd_lane_linear (tree op, class loop *loop,
>  }
>  }
>  
> -/* Return the number of elements in vector type VECTYPE, which is associated
> -   with a SIMD clone.  At present these vectors always have a constant
> -   length.  */
> -
> -static unsigned HOST_WIDE_INT
> -simd_clone_subparts (tree vectype)
> -{
> -  return TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> -}
> -
>  /* Function vectorizable_simd_clone_call.
>  
> Check if STMT_INFO performs a function call that can be vectorized
> @@ -4251,7 +4241,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
> stmt_vec_info stmt_info,
> slp_node);
>   if (arginfo[i].vectype == NULL
>   || !constant_multiple_p (bestn->simdclone->simdlen,
> -  simd_clone_subparts (arginfo[i].vectype)))
> +  TYPE_VECTOR_SUBPARTS (arginfo[i].vectype)))
> return false;
>}
>  
> @@ -4349,15 +4339,19 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>   case SIMD_CLONE_ARG_TYPE_VECTOR:
> atype = bestn->simdclone->args[i].vector_type;
> o = vector_unroll_factor (nunits,
> - simd_clone_subparts (atype));
> + TYPE_VECTOR_SUBPARTS (atype));
> for (m = j * o; m < (j + 1) * o; m++)
>   {
> -   if (simd_clone_subparts (atype)
> -   < simd_clone_subparts (arginfo[i].vectype))
> +   poly_uint64 atype_subparts = TYPE_VECTOR_SUBPARTS (atype);
> +   poly_uint64 arginfo_subparts
> + = TYPE_VECTOR_SUBPARTS (arginfo[i].vectype);
> +   if (known_lt (atype_subparts, arginfo_subparts))
>   {
> poly_uint64 prec = GET_MODE_BITSIZE (TYPE_MODE (atype));

Re: [RFC 0/X] Implement GCC support for AArch64 libmvec

2023-04-20 Thread Richard Sandiford via Gcc-patches
"Andre Vieira (lists)"  writes:
> Hi all,
>
> This is a series of patches/RFCs to implement support in GCC to be able 
> to target AArch64's libmvec functions that will be/are being added to glibc.
> We have chosen to use the omp pragma '#pragma omp declare variant ...' 
> with a simd construct as the way for glibc to inform GCC what functions 
> are available.
>
> For example, if we would like to supply a vector version of the scalar 
> 'cosf' we would have an include file with something like:
> typedef __attribute__((__neon_vector_type__(4))) float __f32x4_t;
> typedef __attribute__((__neon_vector_type__(2))) float __f32x2_t;
> typedef __SVFloat32_t __sv_f32_t;
> typedef __SVBool_t __sv_bool_t;
> __f32x4_t _ZGVnN4v_cosf (__f32x4_t);
> __f32x2_t _ZGVnN2v_cosf (__f32x2_t);
> __sv_f32_t _ZGVsMxv_cosf (__sv_f32_t, __sv_bool_t);
> #pragma omp declare variant(_ZGVnN4v_cosf) \
>  match(construct = {simd(notinbranch, simdlen(4))}, device = 
> {isa("simd")})
> #pragma omp declare variant(_ZGVnN2v_cosf) \
>  match(construct = {simd(notinbranch, simdlen(2))}, device = 
> {isa("simd")})
> #pragma omp declare variant(_ZGVsMxv_cosf) \
>  match(construct = {simd(inbranch)}, device = {isa("sve")})
> extern float cosf (float);
>
> The BETA ABI can be found in the vfabia64 subdir of 
> https://github.com/ARM-software/abi-aa/
> This currently disagrees with how this patch series implements 'omp 
> declare simd' for SVE and I also do not see a need for the 'omp declare 
> variant' scalable extension constructs. I will make changes to the ABI 
> once we've finalized the co-design of the ABI and this implementation.

I don't see a good reason for dropping the extension("scalable").
The problem is that since the base spec requires a simdlen clause,
GCC should in general raise an error if simdlen is omitted.
Relaxing that for an explicit extension seems better than doing it
only based on the ISA (which should in general be a free-form string).
Having "scalable" in the definition also helps to make the intent clearer.

Any change to the declare simd behaviour should probably be agreed
with the LLVM folks first.  Like you say, we already know that GCC
can do your version, since it already does the equivalent thing for x86.

I'm not sure, but I'm guessing the declare simd VFABI was written
that way because, at the time (several years ago), there were
concerns about switching SVE on and off on a function-by-function
basis in LLVM.

But I'm not sure it makes sense to ignore -msve-vector-bits= when
compiling the SVE version (which is what patch 4 seems to do).
If someone compiles with -march=armv8.4-a, we'll use all Armv8.4-A
features in the Advanced SIMD routines.  Why should we ignore
SVE-related target information for the SVE routines?

Of course, the fact that we take command-line options into account
means that omp simd/variant clauses on linkonce/comdat group functions
are an ODR violation waiting to happen.  But the same is true for the
original scalar functions that the clauses are attached to.

Thanks,
Richard

> The patch series has three main steps:
> 1) Add SVE support for 'omp declare simd', see PR 96342
> 2) Enable GCC to use omp declare variants with simd constructs as simd 
> clones during auto-vectorization.
> 3) Add SLP support for vectorizable_simd_clone_call (This sounded like a 
> nice thing to add as we want to move away from non-slp vectorization).
>
> Below you can see the list of current Patches/RFCs, the difference being 
> on how confident I am of the proposed changes. For the RFC I am hoping 
> to get early comments on the approach, rather than more indepth 
> code-reviews.
>
> I appreciate we are still in Stage 4, so I can completely understand if 
> you don't have time to review this now, but I thought it can't hurt to 
> post these early.
>
> Andre Vieira:
> [PATCH] omp: Replace simd_clone_subparts with TYPE_VECTOR_SUBPARTS
> [PATCH] parloops: Copy target and optimizations when creating a function 
> clone
> [PATCH] parloops: Allow poly nit and bound
> [RFC] omp, aarch64: Add SVE support for 'omp declare simd' [PR 96342]
> [RFC] omp: Create simd clones from 'omp declare variant's
> [RFC] omp: Allow creation of simd clones from omp declare variant with 
> -fopenmp-simd flag
>
> Work in progress:
> [RFC] vect: Enable SLP codegen for vectorizable_simd_clone_call


Re: [PATCH v4 07/10] vect: Verify that GET_MODE_NUNITS is a multiple of 2.

2023-04-20 Thread Richard Sandiford via Gcc-patches
 writes:
> Yes, like kito said.
> We won't enable VNx1DImode in auto-vectorization so it's meaningless to fix 
> it here.
> We dynamic adjust the minimum vector-length for different '-march' according 
> to RVV ISA specification.
> So we strongly suggest that we should drop this fix.

I think the patch should go in regardless.  If we have a port with
a VNx1 mode then the exact_div is at best dubious and at worst wrong.

Thanks,
Richard


Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization

2023-04-20 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Thu, 20 Apr 2023, Richard Sandiford wrote:
>
>> "juzhe.zh...@rivai.ai"  writes:
>> > OK. Thanks Richard.
>> > So let me conclude:
>> > 1. Community agree that I should support variable IV in the middle-end.
>> > 2. We can keep WHILE_LEN pattern when "not only final iteration is 
>> > partial".
>> > And I should describe it more clearly in the doc.
>> >
>> > I should do these 2 things in the later update patch.
>> 
>> Sounds good to me, but Richi is the maintainer.
>
> But I happily defer to you for designing VL stuff.  I suppose it's
> time to make you maintainer as well ... (you are global reviewer).

Heh, wasn't trying to bag an extra maintainership :-)  I just got a
bit lost in the thread and wasn't sure whether I was contradicting
something you'd said (in which case I'd defer to that).

Richard




Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization

2023-04-20 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> OK. Thanks Richard.
> So let me conclude:
> 1. Community agree that I should support variable IV in the middle-end.
> 2. We can keep WHILE_LEN pattern when "not only final iteration is partial".
> And I should describe it more clearly in the doc.
>
> I should do these 2 things in the later update patch.

Sounds good to me, but Richi is the maintainer.

Thanks,
Richard


<    1   2   3   4   5   6   7   8   9   10   >