Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-06-12 Thread Tejas Belagod via Gcc-patches


From: Richard Sandiford 
Date: Friday, May 19, 2023 at 3:20 PM
To: Tejas Belagod 
Cc: gcc-patches@gcc.gnu.org 
Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod  writes:
> Am I correct to understand that we still need to check for the case when
> there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2?
> eg. { 0, 0, 1, 1, 1, 1,} which should be encoded as {0, 0, 1, 1} with
> NPATTERNS = 2 ?

Yeah, that's right.  The current handling for NPATTERNS==2 looked
good to me.  It was the other two cases that I was worried about.

Thanks,
Richard

Thanks for all the reviews. I’ve posted a new version of the patch here - 
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621310.html

Thanks,
Tejas.



Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-19 Thread Richard Sandiford via Gcc-patches
Tejas Belagod  writes:
> Am I correct to understand that we still need to check for the case when
> there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2?
> eg. { 0, 0, 1, 1, 1, 1,} which should be encoded as {0, 0, 1, 1} with
> NPATTERNS = 2 ?

Yeah, that's right.  The current handling for NPATTERNS==2 looked
good to me.  It was the other two cases that I was worried about.

Thanks,
Richard


Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-19 Thread Tejas Belagod via Gcc-patches



From: Richard Sandiford 
Date: Tuesday, May 16, 2023 at 5:36 PM
To: Tejas Belagod 
Cc: gcc-patches@gcc.gnu.org 
Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod  writes:
>>> +   {
>>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>>> + bitsize_int (step * BITS_PER_UNIT),
>>> + bitsize_int ((16 - step) * BITS_PER_UNIT));
>>> +
>>> + return gimple_build_assign (f.lhs, b);
>>> +   }
>>> +
>>> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>>> +'step_1' in
>>> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>>> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>>> +elements followed by all inactive elements.  */
>>> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>>
>> Following on from the above, maybe use:
>>
>>   !VECTOR_CST_NELTS (pred).is_constant ()
>>
>> instead of !const_vl here.
>>
>> I have a horrible suspicion that I'm contradicting our earlier discussion
>> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>>
>>
>>
>> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the
> encoded
>> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
>> base1 repeats in the encoding. This loop is checking this condition and looks
>> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector.
>> Please correct me if I’m misunderstanding here.
>
> NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the
> entire sequence is repeated to fill a vector.  So if an NELTS_PER_PATTERN
> == 1 constant has elements {0, 1, 0, 0}, the vector is:
>
>{0, 1, 0, 0, 0, 1, 0, 0, ...}
>
>
> Wouldn’t the vect_all_same(pred, step) cover this case for a given value of
> step?
>
>
> and the optimisation can't handle that.  NELTS_PER_PATTERN == 3 isn't
> likely to occur for predicates, but in principle it has the same problem.
>
>
>
> OK, I had misunderstood the encoding to always make base1 the repeating value
> by adjusting the NPATTERNS accordingly – I didn’t know you could also have the
> base2 value and beyond encoding the repeat value. In this case could I just
> remove NELTS_PER_PATTERN == 2 condition and the enclosed loop would check for 
> a
> repeating ‘1’ in the repeated part of the encoded pattern?

But for NELTS_PER_PATTERN==1, the whole encoded sequence repeats.
So you would have to start the check at element 0 rather than
NPATTERNS.  And then (for NELTS_PER_PATTERN==1) the loop would reject
any constant that has a nonzero element.  But all valid zero-vector
cases have been handled by this point, so the effect wouldn't be useful.

It should never be the case that all elements from NPATTERNS
onwards are zero for NELTS_PER_PATTERN==3; that case should be
canonicalised to NELTS_PER_PATTERN==2 instead.

So in practice it's simpler and more obviously correct to punt
when NELTS_PER_PATTERN != 2.

Thanks for the clarification.
I understand all points about punting when NELTS_PER_PATTERN !=2, but one.

Am I correct to understand that we still need to check for the case when 
there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2? 
eg. { 0, 0, 1, 1, 1, 1,} which should be encoded as {0, 0, 1, 1} with 
NPATTERNS = 2 ?

Thanks,
Tejas.


Thanks,
Richard


Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-16 Thread Richard Sandiford via Gcc-patches
Tejas Belagod  writes:
>>> +   {
>>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>>> + bitsize_int (step * BITS_PER_UNIT),
>>> + bitsize_int ((16 - step) * BITS_PER_UNIT));
>>> +
>>> + return gimple_build_assign (f.lhs, b);
>>> +   }
>>> +
>>> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>>> +'step_1' in
>>> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>>> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>>> +elements followed by all inactive elements.  */
>>> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>>
>> Following on from the above, maybe use:
>>
>>   !VECTOR_CST_NELTS (pred).is_constant ()
>>
>> instead of !const_vl here.
>>
>> I have a horrible suspicion that I'm contradicting our earlier discussion
>> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>>
>> 
>>
>> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the
> encoded
>> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
>> base1 repeats in the encoding. This loop is checking this condition and looks
>> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector.
>> Please correct me if I’m misunderstanding here.
>
> NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the
> entire sequence is repeated to fill a vector.  So if an NELTS_PER_PATTERN
> == 1 constant has elements {0, 1, 0, 0}, the vector is:
>
>{0, 1, 0, 0, 0, 1, 0, 0, ...}
>
>
> Wouldn’t the vect_all_same(pred, step) cover this case for a given value of
> step?
>
>
> and the optimisation can't handle that.  NELTS_PER_PATTERN == 3 isn't
> likely to occur for predicates, but in principle it has the same problem.
>
>  
>
> OK, I had misunderstood the encoding to always make base1 the repeating value
> by adjusting the NPATTERNS accordingly – I didn’t know you could also have the
> base2 value and beyond encoding the repeat value. In this case could I just
> remove NELTS_PER_PATTERN == 2 condition and the enclosed loop would check for 
> a
> repeating ‘1’ in the repeated part of the encoded pattern?

But for NELTS_PER_PATTERN==1, the whole encoded sequence repeats.
So you would have to start the check at element 0 rather than
NPATTERNS.  And then (for NELTS_PER_PATTERN==1) the loop would reject
any constant that has a nonzero element.  But all valid zero-vector
cases have been handled by this point, so the effect wouldn't be useful.

It should never be the case that all elements from NPATTERNS
onwards are zero for NELTS_PER_PATTERN==3; that case should be
canonicalised to NELTS_PER_PATTERN==2 instead.

So in practice it's simpler and more obviously correct to punt
when NELTS_PER_PATTERN != 2.

Thanks,
Richard


Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-16 Thread Tejas Belagod via Gcc-patches



From: Richard Sandiford 
Date: Tuesday, May 16, 2023 at 2:15 PM
To: Tejas Belagod 
Cc: gcc-patches@gcc.gnu.org 
Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod  writes:
>> +  {
>> +int i;
>> +int nelts = vector_cst_encoded_nelts (v);
>> +int first_el = 0;
>> +
>> +for (i = first_el; i < nelts; i += step)
>> +  if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v,
> first_el))
>
> I think this should use !operand_equal_p (..., ..., 0).
>
>
> Oops! I wonder why I thought VECTOR_CST_ENCODED_ELT returned a constant! 
> Thanks
> for spotting that.

It does only return a constant.  But there can be multiple trees with
the same constant value, through things like TREE_OVERFLOW (not sure
where things stand on expunging that from gimple) and the fact that
gimple does not maintain a distinction between different types that
have the same mode and signedness.  (E.g. on ILP32 hosts, gimple does
not maintain a distinction between int and long, even though int 0 and
long 0 are different trees.)

> Also, should the flags here be OEP_ONLY_CONST ?

Nah, just 0 should be fine.

>> + return false;
>> +
>> +return true;
>> +  }
>> +
>> +  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
>> + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
>> + the index of the element being accessed is in the range of a NEON
> vector
>> + width.  */
>
> s/NEON/Advanced SIMD/.  Same in later comments
>
>> +  gimple *fold (gimple_folder & f) const override
>> +  {
>> +tree pred = gimple_call_arg (f.call, 0);
>> +tree val = gimple_call_arg (f.call, 1);
>> +
>> +if (TREE_CODE (pred) == VECTOR_CST)
>> +  {
>> + HOST_WIDE_INT pos;
>> + unsigned int const_vg;
>> + int i = 0;
>> + int step = f.type_suffix (0).element_bytes;
>> + int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
>> + int npats = VECTOR_CST_NPATTERNS (pred);
>> + unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred);
>> + tree b = NULL_TREE;
>> + bool const_vl = aarch64_sve_vg.is_constant (&const_vg);
>
> I think this might be left over from previous versions, but:
> const_vg isn't used and const_vl is only used once, so I think it
> would be better to remove them.
>
>> +
>> + /* We can optimize 2 cases common to variable and fixed-length cases
>> +without a linear search of the predicate vector:
>> +1.  LASTA if predicate is all true, return element 0.
>> +2.  LASTA if predicate all false, return element 0.  */
>> + if (is_lasta () && vect_all_same (pred, step_1))
>> +   {
>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> + bitsize_int (step * BITS_PER_UNIT), bitsize_int (0));
>> + return gimple_build_assign (f.lhs, b);
>> +   }
>> +
>> + /* Handle the all-false case for LASTB where SVE VL == 128b -
>> +return the highest numbered element.  */
>> + if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16)
>> + && vect_all_same (pred, step_1)
>> + && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0)))
>
> Formatting nit: one condition per line once one line isn't enough.
>
>> +   {
>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> + bitsize_int (step * BITS_PER_UNIT),
>> + bitsize_int ((16 - step) * BITS_PER_UNIT));
>> +
>> + return gimple_build_assign (f.lhs, b);
>> +   }
>> +
>> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>> +'step_1' in
>> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>> +elements followed by all inactive elements.  */
>> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>
> Following on from the above, maybe use:
>
>   !VECTOR_CST_NELTS (pred).is_constant ()
>
> instead of !const_vl here.
>
> I have a horrible suspicion that I'm contradicting our earlier discussion
> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>
>
>
> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the encoded
> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
> base1 repeats in the encoding. This loop is checking this condition and looks
> for a

Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-16 Thread Richard Sandiford via Gcc-patches
Tejas Belagod  writes:
>> +  {
>> +int i;
>> +int nelts = vector_cst_encoded_nelts (v);
>> +int first_el = 0;
>> +
>> +for (i = first_el; i < nelts; i += step)
>> +  if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v,
> first_el))
>
> I think this should use !operand_equal_p (..., ..., 0).
>
>
> Oops! I wonder why I thought VECTOR_CST_ENCODED_ELT returned a constant! 
> Thanks
> for spotting that.

It does only return a constant.  But there can be multiple trees with
the same constant value, through things like TREE_OVERFLOW (not sure
where things stand on expunging that from gimple) and the fact that
gimple does not maintain a distinction between different types that
have the same mode and signedness.  (E.g. on ILP32 hosts, gimple does
not maintain a distinction between int and long, even though int 0 and
long 0 are different trees.)

> Also, should the flags here be OEP_ONLY_CONST ?

Nah, just 0 should be fine.

>> + return false;
>> +
>> +return true;
>> +  }
>> +
>> +  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
>> + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
>> + the index of the element being accessed is in the range of a NEON
> vector
>> + width.  */
>
> s/NEON/Advanced SIMD/.  Same in later comments
>
>> +  gimple *fold (gimple_folder & f) const override
>> +  {
>> +tree pred = gimple_call_arg (f.call, 0);
>> +tree val = gimple_call_arg (f.call, 1);
>> +
>> +if (TREE_CODE (pred) == VECTOR_CST)
>> +  {
>> + HOST_WIDE_INT pos;
>> + unsigned int const_vg;
>> + int i = 0;
>> + int step = f.type_suffix (0).element_bytes;
>> + int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
>> + int npats = VECTOR_CST_NPATTERNS (pred);
>> + unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred);
>> + tree b = NULL_TREE;
>> + bool const_vl = aarch64_sve_vg.is_constant (&const_vg);
>
> I think this might be left over from previous versions, but:
> const_vg isn't used and const_vl is only used once, so I think it
> would be better to remove them.
>
>> +
>> + /* We can optimize 2 cases common to variable and fixed-length cases
>> +without a linear search of the predicate vector:
>> +1.  LASTA if predicate is all true, return element 0.
>> +2.  LASTA if predicate all false, return element 0.  */
>> + if (is_lasta () && vect_all_same (pred, step_1))
>> +   {
>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> + bitsize_int (step * BITS_PER_UNIT), bitsize_int (0));
>> + return gimple_build_assign (f.lhs, b);
>> +   }
>> +
>> + /* Handle the all-false case for LASTB where SVE VL == 128b -
>> +return the highest numbered element.  */
>> + if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16)
>> + && vect_all_same (pred, step_1)
>> + && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0)))
>
> Formatting nit: one condition per line once one line isn't enough.
>
>> +   {
>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val,
>> + bitsize_int (step * BITS_PER_UNIT),
>> + bitsize_int ((16 - step) * BITS_PER_UNIT));
>> +
>> + return gimple_build_assign (f.lhs, b);
>> +   }
>> +
>> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of
>> +'step_1' in
>> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1]
>> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS
>> +elements followed by all inactive elements.  */
>> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2)
>
> Following on from the above, maybe use:
>
>   !VECTOR_CST_NELTS (pred).is_constant ()
>
> instead of !const_vl here.
>
> I have a horrible suspicion that I'm contradicting our earlier discussion
> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2.
>
>  
>
> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the encoded
> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the
> base1 repeats in the encoding. This loop is checking this condition and looks
> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector.
> Please correct me if I’m misunderstanding here.

NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the
entire sequence is repeated to fill a vector.  So if an NELTS_PER_PATTERN
== 1 constant has elements {0, 1, 0, 0}, the vector is:

   {0, 1, 0, 0, 0, 1, 0, 0, ...}

and the optimisation can't handle that.  NELTS_PER_PATTERN == 3 isn't
likely to occur for predicates, but in principle it has the same problem.

Thanks,
Richard


Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-16 Thread Tejas Belagod via Gcc-patches
Thanks for your comments, Richard.

From: Richard Sandiford 
Date: Friday, May 12, 2023 at 1:02 AM
To: Tejas Belagod 
Cc: gcc-patches@gcc.gnu.org , Tejas Belagod 

Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod  writes:
> From: Tejas Belagod 
>
>   This PR optimizes an SVE intrinsics sequence where
> svlasta (svptrue_pat_b8 (SV_VL1), x)
>   a scalar is selected based on a constant predicate and a variable vector.
>   This sequence is optimized to return the correspoding element of a NEON
>   vector. For eg.
> svlasta (svptrue_pat_b8 (SV_VL1), x)
>   returns
> umovw0, v0.b[1]
>   Likewise,
> svlastb (svptrue_pat_b8 (SV_VL1), x)
>   returns
>  umovw0, v0.b[0]
>   This optimization only works provided the constant predicate maps to a range
>   that is within the bounds of a 128-bit NEON register.
>
> gcc/ChangeLog:
>
>PR target/96339
>* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): 
> Fold sve
>calls that have a constant input predicate vector.
>(svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
>(svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
>(svlast_impl::vect_all_same): Check if all vector elements are equal.
>
> gcc/testsuite/ChangeLog:
>
>PR target/96339
>* gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
>* gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
>* gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
>* gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
>to expect optimized code for function body.
>* gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
>* gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc  | 124 +++
>  .../aarch64/sve/acle/general-c/svlast.c   |  63 
>  .../sve/acle/general-c/svlast128_run.c| 313 +
>  .../sve/acle/general-c/svlast256_run.c| 314 ++
>  .../gcc.target/aarch64/sve/pcs/return_4.c |   2 -
>  .../aarch64/sve/pcs/return_4_1024.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_128.c |   2 -
>  .../aarch64/sve/pcs/return_4_2048.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_512.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5.c |   2 -
>  .../aarch64/sve/pcs/return_5_1024.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_128.c |   2 -
>  .../aarch64/sve/pcs/return_5_2048.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_512.c |   2 -
>  16 files changed, 814 insertions(+), 24 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index cd9cace3c9b..db2b4dcaac9 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -1056,6 +1056,130 @@ class svlast_impl : public quiet
>  public:
>CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {}
>
> +  bool is_lasta () const { return m_unspec == UNSPEC_LASTA; }
> +  bool is_lastb () const { return m_unspec == UNSPEC_LASTB; }
> +
> +  bool vect_all_same (tree v , int step) const

Nit: stray space after "v".

> +  {
> +int i;
> +int nelts = vector_cst_encoded_nelts (v);
> +int first_el = 0;
> +
> +for (i = first_el; i < nelts; i += step)
> +  if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, 
> first_el))

I think this should use !operand_eq

Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-11 Thread Richard Sandiford via Gcc-patches
Tejas Belagod  writes:
> From: Tejas Belagod 
>
>   This PR optimizes an SVE intrinsics sequence where
> svlasta (svptrue_pat_b8 (SV_VL1), x)
>   a scalar is selected based on a constant predicate and a variable vector.
>   This sequence is optimized to return the correspoding element of a NEON
>   vector. For eg.
> svlasta (svptrue_pat_b8 (SV_VL1), x)
>   returns
> umovw0, v0.b[1]
>   Likewise,
> svlastb (svptrue_pat_b8 (SV_VL1), x)
>   returns
>  umovw0, v0.b[0]
>   This optimization only works provided the constant predicate maps to a range
>   that is within the bounds of a 128-bit NEON register.
>
> gcc/ChangeLog:
>
>   PR target/96339
>   * config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold 
> sve
>   calls that have a constant input predicate vector.
>   (svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
>   (svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
>   (svlast_impl::vect_all_same): Check if all vector elements are equal.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/96339
>   * gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
>   * gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
>   * gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
>   * gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
>   to expect optimized code for function body.
>   * gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
>   * gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc  | 124 +++
>  .../aarch64/sve/acle/general-c/svlast.c   |  63 
>  .../sve/acle/general-c/svlast128_run.c| 313 +
>  .../sve/acle/general-c/svlast256_run.c| 314 ++
>  .../gcc.target/aarch64/sve/pcs/return_4.c |   2 -
>  .../aarch64/sve/pcs/return_4_1024.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_128.c |   2 -
>  .../aarch64/sve/pcs/return_4_2048.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_4_512.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5.c |   2 -
>  .../aarch64/sve/pcs/return_5_1024.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_128.c |   2 -
>  .../aarch64/sve/pcs/return_5_2048.c   |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_256.c |   2 -
>  .../gcc.target/aarch64/sve/pcs/return_5_512.c |   2 -
>  16 files changed, 814 insertions(+), 24 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index cd9cace3c9b..db2b4dcaac9 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -1056,6 +1056,130 @@ class svlast_impl : public quiet
>  public:
>CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {}
>  
> +  bool is_lasta () const { return m_unspec == UNSPEC_LASTA; }
> +  bool is_lastb () const { return m_unspec == UNSPEC_LASTB; }
> +
> +  bool vect_all_same (tree v , int step) const

Nit: stray space after "v".

> +  {
> +int i;
> +int nelts = vector_cst_encoded_nelts (v);
> +int first_el = 0;
> +
> +for (i = first_el; i < nelts; i += step)
> +  if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, 
> first_el))

I think this should use !operand_equal_p (..., ..., 0).

> + return false;
> +
> +return true;
> +  }
> +
> +  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
> + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
> + the index of the element being accessed is in the range of a NEON vector
> + width.  */

s/NEON/Advanced SIMD/.  Same in later comments

> +  gimple *fold (gimple_folder & f) const override
> +  {
> +tree pred = gimple_call_arg (f.call, 0);
> +tree val = gimple_call_arg (f.call, 1);
> 

Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-05-03 Thread Tejas Belagod via Gcc-patches
[Ping]

From: Tejas Belagod 
Date: Thursday, March 16, 2023 at 5:09 PM
To: gcc-patches@gcc.gnu.org 
Cc: Tejas Belagod , Richard Sandiford 

Subject: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
From: Tejas Belagod 

  This PR optimizes an SVE intrinsics sequence where
svlasta (svptrue_pat_b8 (SV_VL1), x)
  a scalar is selected based on a constant predicate and a variable vector.
  This sequence is optimized to return the correspoding element of a NEON
  vector. For eg.
svlasta (svptrue_pat_b8 (SV_VL1), x)
  returns
umovw0, v0.b[1]
  Likewise,
svlastb (svptrue_pat_b8 (SV_VL1), x)
  returns
 umovw0, v0.b[0]
  This optimization only works provided the constant predicate maps to a range
  that is within the bounds of a 128-bit NEON register.

gcc/ChangeLog:

PR target/96339
* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold 
sve
calls that have a constant input predicate vector.
(svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
(svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
(svlast_impl::vect_all_same): Check if all vector elements are equal.

gcc/testsuite/ChangeLog:

PR target/96339
* gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
* gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
* gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
* gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
to expect optimized code for function body.
* gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
---
 .../aarch64/aarch64-sve-builtins-base.cc  | 124 +++
 .../aarch64/sve/acle/general-c/svlast.c   |  63 
 .../sve/acle/general-c/svlast128_run.c| 313 +
 .../sve/acle/general-c/svlast256_run.c| 314 ++
 .../gcc.target/aarch64/sve/pcs/return_4.c |   2 -
 .../aarch64/sve/pcs/return_4_1024.c   |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_128.c |   2 -
 .../aarch64/sve/pcs/return_4_2048.c   |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_256.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_512.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5.c |   2 -
 .../aarch64/sve/pcs/return_5_1024.c   |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_128.c |   2 -
 .../aarch64/sve/pcs/return_5_2048.c   |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_256.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_512.c |   2 -
 16 files changed, 814 insertions(+), 24 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index cd9cace3c9b..db2b4dcaac9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1056,6 +1056,130 @@ class svlast_impl : public quiet
 public:
   CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {}

+  bool is_lasta () const { return m_unspec == UNSPEC_LASTA; }
+  bool is_lastb () const { return m_unspec == UNSPEC_LASTB; }
+
+  bool vect_all_same (tree v , int step) const
+  {
+int i;
+int nelts = vector_cst_encoded_nelts (v);
+int first_el = 0;
+
+for (i = first_el; i < nelts; i += step)
+  if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, 
first_el))
+   return false;
+
+return true;
+  }
+
+  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
+ BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
+ the index of the element being accessed is in the range of a NEON vector
+ width.  */
+  gimple *fold (gimple_folder & f) const override
+  {
+tree pred = gimple_call_arg (f.call, 0);
+tree val = gimple_call_arg (f.call, 1);
+
+if (TREE_CODE (pred) == VECTOR_CST)
+  {
+   HOST_WIDE_INT pos;
+   unsigned int const_vg;
+ 

[PATCH] [PR96339] AArch64: Optimise svlast[ab]

2023-03-16 Thread Tejas Belagod via Gcc-patches
From: Tejas Belagod 

  This PR optimizes an SVE intrinsics sequence where
svlasta (svptrue_pat_b8 (SV_VL1), x)
  a scalar is selected based on a constant predicate and a variable vector.
  This sequence is optimized to return the correspoding element of a NEON
  vector. For eg.
svlasta (svptrue_pat_b8 (SV_VL1), x)
  returns
umovw0, v0.b[1]
  Likewise,
svlastb (svptrue_pat_b8 (SV_VL1), x)
  returns
 umovw0, v0.b[0]
  This optimization only works provided the constant predicate maps to a range
  that is within the bounds of a 128-bit NEON register.

gcc/ChangeLog:

PR target/96339
* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold 
sve
calls that have a constant input predicate vector.
(svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
(svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
(svlast_impl::vect_all_same): Check if all vector elements are equal.

gcc/testsuite/ChangeLog:

PR target/96339
* gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
* gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
* gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
* gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
to expect optimized code for function body.
* gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
---
 .../aarch64/aarch64-sve-builtins-base.cc  | 124 +++
 .../aarch64/sve/acle/general-c/svlast.c   |  63 
 .../sve/acle/general-c/svlast128_run.c| 313 +
 .../sve/acle/general-c/svlast256_run.c| 314 ++
 .../gcc.target/aarch64/sve/pcs/return_4.c |   2 -
 .../aarch64/sve/pcs/return_4_1024.c   |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_128.c |   2 -
 .../aarch64/sve/pcs/return_4_2048.c   |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_256.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_4_512.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5.c |   2 -
 .../aarch64/sve/pcs/return_5_1024.c   |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_128.c |   2 -
 .../aarch64/sve/pcs/return_5_2048.c   |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_256.c |   2 -
 .../gcc.target/aarch64/sve/pcs/return_5_512.c |   2 -
 16 files changed, 814 insertions(+), 24 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index cd9cace3c9b..db2b4dcaac9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1056,6 +1056,130 @@ class svlast_impl : public quiet
 public:
   CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {}
 
+  bool is_lasta () const { return m_unspec == UNSPEC_LASTA; }
+  bool is_lastb () const { return m_unspec == UNSPEC_LASTB; }
+
+  bool vect_all_same (tree v , int step) const
+  {
+int i;
+int nelts = vector_cst_encoded_nelts (v);
+int first_el = 0;
+
+for (i = first_el; i < nelts; i += step)
+  if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, 
first_el))
+   return false;
+
+return true;
+  }
+
+  /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF.
+ BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure
+ the index of the element being accessed is in the range of a NEON vector
+ width.  */
+  gimple *fold (gimple_folder & f) const override
+  {
+tree pred = gimple_call_arg (f.call, 0);
+tree val = gimple_call_arg (f.call, 1);
+
+if (TREE_CODE (pred) == VECTOR_CST)
+  {
+   HOST_WIDE_INT pos;
+   unsigned int const_vg;
+   int i = 0;
+   int step = f.type_suffix (0).element_bytes;
+   int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred));
+   int npats = VECTOR_CST_NPATTERNS (pred);
+   unsigned HOST_WIDE_