Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
From: Richard Sandiford Date: Friday, May 19, 2023 at 3:20 PM To: Tejas Belagod Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab] Tejas Belagod writes: > Am I correct to understand that we still need to check for the case when > there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2? > eg. { 0, 0, 1, 1, 1, 1,} which should be encoded as {0, 0, 1, 1} with > NPATTERNS = 2 ? Yeah, that's right. The current handling for NPATTERNS==2 looked good to me. It was the other two cases that I was worried about. Thanks, Richard Thanks for all the reviews. I’ve posted a new version of the patch here - https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621310.html Thanks, Tejas.
Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod writes: > Am I correct to understand that we still need to check for the case when > there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2? > eg. { 0, 0, 1, 1, 1, 1,} which should be encoded as {0, 0, 1, 1} with > NPATTERNS = 2 ? Yeah, that's right. The current handling for NPATTERNS==2 looked good to me. It was the other two cases that I was worried about. Thanks, Richard
Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
From: Richard Sandiford Date: Tuesday, May 16, 2023 at 5:36 PM To: Tejas Belagod Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab] Tejas Belagod writes: >>> + { >>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val, >>> + bitsize_int (step * BITS_PER_UNIT), >>> + bitsize_int ((16 - step) * BITS_PER_UNIT)); >>> + >>> + return gimple_build_assign (f.lhs, b); >>> + } >>> + >>> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of >>> +'step_1' in >>> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1] >>> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS >>> +elements followed by all inactive elements. */ >>> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2) >> >> Following on from the above, maybe use: >> >> !VECTOR_CST_NELTS (pred).is_constant () >> >> instead of !const_vl here. >> >> I have a horrible suspicion that I'm contradicting our earlier discussion >> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2. >> >> >> >> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the > encoded >> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the >> base1 repeats in the encoding. This loop is checking this condition and looks >> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector. >> Please correct me if I’m misunderstanding here. > > NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the > entire sequence is repeated to fill a vector. So if an NELTS_PER_PATTERN > == 1 constant has elements {0, 1, 0, 0}, the vector is: > >{0, 1, 0, 0, 0, 1, 0, 0, ...} > > > Wouldn’t the vect_all_same(pred, step) cover this case for a given value of > step? > > > and the optimisation can't handle that. NELTS_PER_PATTERN == 3 isn't > likely to occur for predicates, but in principle it has the same problem. > > > > OK, I had misunderstood the encoding to always make base1 the repeating value > by adjusting the NPATTERNS accordingly – I didn’t know you could also have the > base2 value and beyond encoding the repeat value. In this case could I just > remove NELTS_PER_PATTERN == 2 condition and the enclosed loop would check for > a > repeating ‘1’ in the repeated part of the encoded pattern? But for NELTS_PER_PATTERN==1, the whole encoded sequence repeats. So you would have to start the check at element 0 rather than NPATTERNS. And then (for NELTS_PER_PATTERN==1) the loop would reject any constant that has a nonzero element. But all valid zero-vector cases have been handled by this point, so the effect wouldn't be useful. It should never be the case that all elements from NPATTERNS onwards are zero for NELTS_PER_PATTERN==3; that case should be canonicalised to NELTS_PER_PATTERN==2 instead. So in practice it's simpler and more obviously correct to punt when NELTS_PER_PATTERN != 2. Thanks for the clarification. I understand all points about punting when NELTS_PER_PATTERN !=2, but one. Am I correct to understand that we still need to check for the case when there's a repeating non-zero elements in the case of NELTS_PER_PATTERN == 2? eg. { 0, 0, 1, 1, 1, 1,} which should be encoded as {0, 0, 1, 1} with NPATTERNS = 2 ? Thanks, Tejas. Thanks, Richard
Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod writes: >>> + { >>> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val, >>> + bitsize_int (step * BITS_PER_UNIT), >>> + bitsize_int ((16 - step) * BITS_PER_UNIT)); >>> + >>> + return gimple_build_assign (f.lhs, b); >>> + } >>> + >>> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of >>> +'step_1' in >>> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1] >>> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS >>> +elements followed by all inactive elements. */ >>> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2) >> >> Following on from the above, maybe use: >> >> !VECTOR_CST_NELTS (pred).is_constant () >> >> instead of !const_vl here. >> >> I have a horrible suspicion that I'm contradicting our earlier discussion >> here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2. >> >> >> >> IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the > encoded >> constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the >> base1 repeats in the encoding. This loop is checking this condition and looks >> for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector. >> Please correct me if I’m misunderstanding here. > > NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the > entire sequence is repeated to fill a vector. So if an NELTS_PER_PATTERN > == 1 constant has elements {0, 1, 0, 0}, the vector is: > >{0, 1, 0, 0, 0, 1, 0, 0, ...} > > > Wouldn’t the vect_all_same(pred, step) cover this case for a given value of > step? > > > and the optimisation can't handle that. NELTS_PER_PATTERN == 3 isn't > likely to occur for predicates, but in principle it has the same problem. > > > > OK, I had misunderstood the encoding to always make base1 the repeating value > by adjusting the NPATTERNS accordingly – I didn’t know you could also have the > base2 value and beyond encoding the repeat value. In this case could I just > remove NELTS_PER_PATTERN == 2 condition and the enclosed loop would check for > a > repeating ‘1’ in the repeated part of the encoded pattern? But for NELTS_PER_PATTERN==1, the whole encoded sequence repeats. So you would have to start the check at element 0 rather than NPATTERNS. And then (for NELTS_PER_PATTERN==1) the loop would reject any constant that has a nonzero element. But all valid zero-vector cases have been handled by this point, so the effect wouldn't be useful. It should never be the case that all elements from NPATTERNS onwards are zero for NELTS_PER_PATTERN==3; that case should be canonicalised to NELTS_PER_PATTERN==2 instead. So in practice it's simpler and more obviously correct to punt when NELTS_PER_PATTERN != 2. Thanks, Richard
Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
From: Richard Sandiford Date: Tuesday, May 16, 2023 at 2:15 PM To: Tejas Belagod Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab] Tejas Belagod writes: >> + { >> +int i; >> +int nelts = vector_cst_encoded_nelts (v); >> +int first_el = 0; >> + >> +for (i = first_el; i < nelts; i += step) >> + if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, > first_el)) > > I think this should use !operand_equal_p (..., ..., 0). > > > Oops! I wonder why I thought VECTOR_CST_ENCODED_ELT returned a constant! > Thanks > for spotting that. It does only return a constant. But there can be multiple trees with the same constant value, through things like TREE_OVERFLOW (not sure where things stand on expunging that from gimple) and the fact that gimple does not maintain a distinction between different types that have the same mode and signedness. (E.g. on ILP32 hosts, gimple does not maintain a distinction between int and long, even though int 0 and long 0 are different trees.) > Also, should the flags here be OEP_ONLY_CONST ? Nah, just 0 should be fine. >> + return false; >> + >> +return true; >> + } >> + >> + /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF. >> + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure >> + the index of the element being accessed is in the range of a NEON > vector >> + width. */ > > s/NEON/Advanced SIMD/. Same in later comments > >> + gimple *fold (gimple_folder & f) const override >> + { >> +tree pred = gimple_call_arg (f.call, 0); >> +tree val = gimple_call_arg (f.call, 1); >> + >> +if (TREE_CODE (pred) == VECTOR_CST) >> + { >> + HOST_WIDE_INT pos; >> + unsigned int const_vg; >> + int i = 0; >> + int step = f.type_suffix (0).element_bytes; >> + int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred)); >> + int npats = VECTOR_CST_NPATTERNS (pred); >> + unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred); >> + tree b = NULL_TREE; >> + bool const_vl = aarch64_sve_vg.is_constant (&const_vg); > > I think this might be left over from previous versions, but: > const_vg isn't used and const_vl is only used once, so I think it > would be better to remove them. > >> + >> + /* We can optimize 2 cases common to variable and fixed-length cases >> +without a linear search of the predicate vector: >> +1. LASTA if predicate is all true, return element 0. >> +2. LASTA if predicate all false, return element 0. */ >> + if (is_lasta () && vect_all_same (pred, step_1)) >> + { >> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val, >> + bitsize_int (step * BITS_PER_UNIT), bitsize_int (0)); >> + return gimple_build_assign (f.lhs, b); >> + } >> + >> + /* Handle the all-false case for LASTB where SVE VL == 128b - >> +return the highest numbered element. */ >> + if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16) >> + && vect_all_same (pred, step_1) >> + && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0))) > > Formatting nit: one condition per line once one line isn't enough. > >> + { >> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val, >> + bitsize_int (step * BITS_PER_UNIT), >> + bitsize_int ((16 - step) * BITS_PER_UNIT)); >> + >> + return gimple_build_assign (f.lhs, b); >> + } >> + >> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of >> +'step_1' in >> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1] >> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS >> +elements followed by all inactive elements. */ >> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2) > > Following on from the above, maybe use: > > !VECTOR_CST_NELTS (pred).is_constant () > > instead of !const_vl here. > > I have a horrible suspicion that I'm contradicting our earlier discussion > here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2. > > > > IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the encoded > constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the > base1 repeats in the encoding. This loop is checking this condition and looks > for a
Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod writes: >> + { >> +int i; >> +int nelts = vector_cst_encoded_nelts (v); >> +int first_el = 0; >> + >> +for (i = first_el; i < nelts; i += step) >> + if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, > first_el)) > > I think this should use !operand_equal_p (..., ..., 0). > > > Oops! I wonder why I thought VECTOR_CST_ENCODED_ELT returned a constant! > Thanks > for spotting that. It does only return a constant. But there can be multiple trees with the same constant value, through things like TREE_OVERFLOW (not sure where things stand on expunging that from gimple) and the fact that gimple does not maintain a distinction between different types that have the same mode and signedness. (E.g. on ILP32 hosts, gimple does not maintain a distinction between int and long, even though int 0 and long 0 are different trees.) > Also, should the flags here be OEP_ONLY_CONST ? Nah, just 0 should be fine. >> + return false; >> + >> +return true; >> + } >> + >> + /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF. >> + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure >> + the index of the element being accessed is in the range of a NEON > vector >> + width. */ > > s/NEON/Advanced SIMD/. Same in later comments > >> + gimple *fold (gimple_folder & f) const override >> + { >> +tree pred = gimple_call_arg (f.call, 0); >> +tree val = gimple_call_arg (f.call, 1); >> + >> +if (TREE_CODE (pred) == VECTOR_CST) >> + { >> + HOST_WIDE_INT pos; >> + unsigned int const_vg; >> + int i = 0; >> + int step = f.type_suffix (0).element_bytes; >> + int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred)); >> + int npats = VECTOR_CST_NPATTERNS (pred); >> + unsigned HOST_WIDE_INT nelts = vector_cst_encoded_nelts (pred); >> + tree b = NULL_TREE; >> + bool const_vl = aarch64_sve_vg.is_constant (&const_vg); > > I think this might be left over from previous versions, but: > const_vg isn't used and const_vl is only used once, so I think it > would be better to remove them. > >> + >> + /* We can optimize 2 cases common to variable and fixed-length cases >> +without a linear search of the predicate vector: >> +1. LASTA if predicate is all true, return element 0. >> +2. LASTA if predicate all false, return element 0. */ >> + if (is_lasta () && vect_all_same (pred, step_1)) >> + { >> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val, >> + bitsize_int (step * BITS_PER_UNIT), bitsize_int (0)); >> + return gimple_build_assign (f.lhs, b); >> + } >> + >> + /* Handle the all-false case for LASTB where SVE VL == 128b - >> +return the highest numbered element. */ >> + if (is_lastb () && known_eq (BYTES_PER_SVE_VECTOR, 16) >> + && vect_all_same (pred, step_1) >> + && integer_zerop (VECTOR_CST_ENCODED_ELT (pred, 0))) > > Formatting nit: one condition per line once one line isn't enough. > >> + { >> + b = build3 (BIT_FIELD_REF, TREE_TYPE (f.lhs), val, >> + bitsize_int (step * BITS_PER_UNIT), >> + bitsize_int ((16 - step) * BITS_PER_UNIT)); >> + >> + return gimple_build_assign (f.lhs, b); >> + } >> + >> + /* If VECTOR_CST_NELTS_PER_PATTERN (pred) == 2 and every multiple of >> +'step_1' in >> +[VECTOR_CST_NPATTERNS .. VECTOR_CST_ENCODED_NELTS - 1] >> +is zero, then we can treat the vector as VECTOR_CST_NPATTERNS >> +elements followed by all inactive elements. */ >> + if (!const_vl && VECTOR_CST_NELTS_PER_PATTERN (pred) == 2) > > Following on from the above, maybe use: > > !VECTOR_CST_NELTS (pred).is_constant () > > instead of !const_vl here. > > I have a horrible suspicion that I'm contradicting our earlier discussion > here, sorry, but: I think we have to return null if NELTS_PER_PATTERN != 2. > > > > IIUC, the NPATTERNS .. ENCODED_ELTS represent the repeated part of the encoded > constant. This means the repetition occurs if NELTS_PER_PATTERN == 2, IOW the > base1 repeats in the encoding. This loop is checking this condition and looks > for a 1 in the repeated part of the NELTS_PER_PATTERN == 2 in a VL vector. > Please correct me if I’m misunderstanding here. NELTS_PER_PATTERN == 1 is also a repeating pattern: it means that the entire sequence is repeated to fill a vector. So if an NELTS_PER_PATTERN == 1 constant has elements {0, 1, 0, 0}, the vector is: {0, 1, 0, 0, 0, 1, 0, 0, ...} and the optimisation can't handle that. NELTS_PER_PATTERN == 3 isn't likely to occur for predicates, but in principle it has the same problem. Thanks, Richard
Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Thanks for your comments, Richard. From: Richard Sandiford Date: Friday, May 12, 2023 at 1:02 AM To: Tejas Belagod Cc: gcc-patches@gcc.gnu.org , Tejas Belagod Subject: Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab] Tejas Belagod writes: > From: Tejas Belagod > > This PR optimizes an SVE intrinsics sequence where > svlasta (svptrue_pat_b8 (SV_VL1), x) > a scalar is selected based on a constant predicate and a variable vector. > This sequence is optimized to return the correspoding element of a NEON > vector. For eg. > svlasta (svptrue_pat_b8 (SV_VL1), x) > returns > umovw0, v0.b[1] > Likewise, > svlastb (svptrue_pat_b8 (SV_VL1), x) > returns > umovw0, v0.b[0] > This optimization only works provided the constant predicate maps to a range > that is within the bounds of a 128-bit NEON register. > > gcc/ChangeLog: > >PR target/96339 >* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): > Fold sve >calls that have a constant input predicate vector. >(svlast_impl::is_lasta): Query to check if intrinsic is svlasta. >(svlast_impl::is_lastb): Query to check if intrinsic is svlastb. >(svlast_impl::vect_all_same): Check if all vector elements are equal. > > gcc/testsuite/ChangeLog: > >PR target/96339 >* gcc.target/aarch64/sve/acle/general-c/svlast.c: New. >* gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New. >* gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New. >* gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm >to expect optimized code for function body. >* gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise. >* gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise. > --- > .../aarch64/aarch64-sve-builtins-base.cc | 124 +++ > .../aarch64/sve/acle/general-c/svlast.c | 63 > .../sve/acle/general-c/svlast128_run.c| 313 + > .../sve/acle/general-c/svlast256_run.c| 314 ++ > .../gcc.target/aarch64/sve/pcs/return_4.c | 2 - > .../aarch64/sve/pcs/return_4_1024.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_4_128.c | 2 - > .../aarch64/sve/pcs/return_4_2048.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_4_256.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_4_512.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_5.c | 2 - > .../aarch64/sve/pcs/return_5_1024.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_5_128.c | 2 - > .../aarch64/sve/pcs/return_5_2048.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_5_256.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_5_512.c | 2 - > 16 files changed, 814 insertions(+), 24 deletions(-) > create mode 100644 > gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c > create mode 100644 > gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c > create mode 100644 > gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c > > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc > b/gcc/config/aarch64/aarch64-sve-builtins-base.cc > index cd9cace3c9b..db2b4dcaac9 100644 > --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc > +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc > @@ -1056,6 +1056,130 @@ class svlast_impl : public quiet > public: >CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {} > > + bool is_lasta () const { return m_unspec == UNSPEC_LASTA; } > + bool is_lastb () const { return m_unspec == UNSPEC_LASTB; } > + > + bool vect_all_same (tree v , int step) const Nit: stray space after "v". > + { > +int i; > +int nelts = vector_cst_encoded_nelts (v); > +int first_el = 0; > + > +for (i = first_el; i < nelts; i += step) > + if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, > first_el)) I think this should use !operand_eq
Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
Tejas Belagod writes: > From: Tejas Belagod > > This PR optimizes an SVE intrinsics sequence where > svlasta (svptrue_pat_b8 (SV_VL1), x) > a scalar is selected based on a constant predicate and a variable vector. > This sequence is optimized to return the correspoding element of a NEON > vector. For eg. > svlasta (svptrue_pat_b8 (SV_VL1), x) > returns > umovw0, v0.b[1] > Likewise, > svlastb (svptrue_pat_b8 (SV_VL1), x) > returns > umovw0, v0.b[0] > This optimization only works provided the constant predicate maps to a range > that is within the bounds of a 128-bit NEON register. > > gcc/ChangeLog: > > PR target/96339 > * config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold > sve > calls that have a constant input predicate vector. > (svlast_impl::is_lasta): Query to check if intrinsic is svlasta. > (svlast_impl::is_lastb): Query to check if intrinsic is svlastb. > (svlast_impl::vect_all_same): Check if all vector elements are equal. > > gcc/testsuite/ChangeLog: > > PR target/96339 > * gcc.target/aarch64/sve/acle/general-c/svlast.c: New. > * gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New. > * gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New. > * gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm > to expect optimized code for function body. > * gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise. > * gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise. > --- > .../aarch64/aarch64-sve-builtins-base.cc | 124 +++ > .../aarch64/sve/acle/general-c/svlast.c | 63 > .../sve/acle/general-c/svlast128_run.c| 313 + > .../sve/acle/general-c/svlast256_run.c| 314 ++ > .../gcc.target/aarch64/sve/pcs/return_4.c | 2 - > .../aarch64/sve/pcs/return_4_1024.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_4_128.c | 2 - > .../aarch64/sve/pcs/return_4_2048.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_4_256.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_4_512.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_5.c | 2 - > .../aarch64/sve/pcs/return_5_1024.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_5_128.c | 2 - > .../aarch64/sve/pcs/return_5_2048.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_5_256.c | 2 - > .../gcc.target/aarch64/sve/pcs/return_5_512.c | 2 - > 16 files changed, 814 insertions(+), 24 deletions(-) > create mode 100644 > gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c > create mode 100644 > gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c > create mode 100644 > gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c > > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc > b/gcc/config/aarch64/aarch64-sve-builtins-base.cc > index cd9cace3c9b..db2b4dcaac9 100644 > --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc > +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc > @@ -1056,6 +1056,130 @@ class svlast_impl : public quiet > public: >CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {} > > + bool is_lasta () const { return m_unspec == UNSPEC_LASTA; } > + bool is_lastb () const { return m_unspec == UNSPEC_LASTB; } > + > + bool vect_all_same (tree v , int step) const Nit: stray space after "v". > + { > +int i; > +int nelts = vector_cst_encoded_nelts (v); > +int first_el = 0; > + > +for (i = first_el; i < nelts; i += step) > + if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, > first_el)) I think this should use !operand_equal_p (..., ..., 0). > + return false; > + > +return true; > + } > + > + /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF. > + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure > + the index of the element being accessed is in the range of a NEON vector > + width. */ s/NEON/Advanced SIMD/. Same in later comments > + gimple *fold (gimple_folder & f) const override > + { > +tree pred = gimple_call_arg (f.call, 0); > +tree val = gimple_call_arg (f.call, 1); >
Re: [PATCH] [PR96339] AArch64: Optimise svlast[ab]
[Ping] From: Tejas Belagod Date: Thursday, March 16, 2023 at 5:09 PM To: gcc-patches@gcc.gnu.org Cc: Tejas Belagod , Richard Sandiford Subject: [PATCH] [PR96339] AArch64: Optimise svlast[ab] From: Tejas Belagod This PR optimizes an SVE intrinsics sequence where svlasta (svptrue_pat_b8 (SV_VL1), x) a scalar is selected based on a constant predicate and a variable vector. This sequence is optimized to return the correspoding element of a NEON vector. For eg. svlasta (svptrue_pat_b8 (SV_VL1), x) returns umovw0, v0.b[1] Likewise, svlastb (svptrue_pat_b8 (SV_VL1), x) returns umovw0, v0.b[0] This optimization only works provided the constant predicate maps to a range that is within the bounds of a 128-bit NEON register. gcc/ChangeLog: PR target/96339 * config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold sve calls that have a constant input predicate vector. (svlast_impl::is_lasta): Query to check if intrinsic is svlasta. (svlast_impl::is_lastb): Query to check if intrinsic is svlastb. (svlast_impl::vect_all_same): Check if all vector elements are equal. gcc/testsuite/ChangeLog: PR target/96339 * gcc.target/aarch64/sve/acle/general-c/svlast.c: New. * gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New. * gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New. * gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm to expect optimized code for function body. * gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise. --- .../aarch64/aarch64-sve-builtins-base.cc | 124 +++ .../aarch64/sve/acle/general-c/svlast.c | 63 .../sve/acle/general-c/svlast128_run.c| 313 + .../sve/acle/general-c/svlast256_run.c| 314 ++ .../gcc.target/aarch64/sve/pcs/return_4.c | 2 - .../aarch64/sve/pcs/return_4_1024.c | 2 - .../gcc.target/aarch64/sve/pcs/return_4_128.c | 2 - .../aarch64/sve/pcs/return_4_2048.c | 2 - .../gcc.target/aarch64/sve/pcs/return_4_256.c | 2 - .../gcc.target/aarch64/sve/pcs/return_4_512.c | 2 - .../gcc.target/aarch64/sve/pcs/return_5.c | 2 - .../aarch64/sve/pcs/return_5_1024.c | 2 - .../gcc.target/aarch64/sve/pcs/return_5_128.c | 2 - .../aarch64/sve/pcs/return_5_2048.c | 2 - .../gcc.target/aarch64/sve/pcs/return_5_256.c | 2 - .../gcc.target/aarch64/sve/pcs/return_5_512.c | 2 - 16 files changed, 814 insertions(+), 24 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index cd9cace3c9b..db2b4dcaac9 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -1056,6 +1056,130 @@ class svlast_impl : public quiet public: CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {} + bool is_lasta () const { return m_unspec == UNSPEC_LASTA; } + bool is_lastb () const { return m_unspec == UNSPEC_LASTB; } + + bool vect_all_same (tree v , int step) const + { +int i; +int nelts = vector_cst_encoded_nelts (v); +int first_el = 0; + +for (i = first_el; i < nelts; i += step) + if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, first_el)) + return false; + +return true; + } + + /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF. + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure + the index of the element being accessed is in the range of a NEON vector + width. */ + gimple *fold (gimple_folder & f) const override + { +tree pred = gimple_call_arg (f.call, 0); +tree val = gimple_call_arg (f.call, 1); + +if (TREE_CODE (pred) == VECTOR_CST) + { + HOST_WIDE_INT pos; + unsigned int const_vg; +
[PATCH] [PR96339] AArch64: Optimise svlast[ab]
From: Tejas Belagod This PR optimizes an SVE intrinsics sequence where svlasta (svptrue_pat_b8 (SV_VL1), x) a scalar is selected based on a constant predicate and a variable vector. This sequence is optimized to return the correspoding element of a NEON vector. For eg. svlasta (svptrue_pat_b8 (SV_VL1), x) returns umovw0, v0.b[1] Likewise, svlastb (svptrue_pat_b8 (SV_VL1), x) returns umovw0, v0.b[0] This optimization only works provided the constant predicate maps to a range that is within the bounds of a 128-bit NEON register. gcc/ChangeLog: PR target/96339 * config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold sve calls that have a constant input predicate vector. (svlast_impl::is_lasta): Query to check if intrinsic is svlasta. (svlast_impl::is_lastb): Query to check if intrinsic is svlastb. (svlast_impl::vect_all_same): Check if all vector elements are equal. gcc/testsuite/ChangeLog: PR target/96339 * gcc.target/aarch64/sve/acle/general-c/svlast.c: New. * gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New. * gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New. * gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm to expect optimized code for function body. * gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise. * gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise. --- .../aarch64/aarch64-sve-builtins-base.cc | 124 +++ .../aarch64/sve/acle/general-c/svlast.c | 63 .../sve/acle/general-c/svlast128_run.c| 313 + .../sve/acle/general-c/svlast256_run.c| 314 ++ .../gcc.target/aarch64/sve/pcs/return_4.c | 2 - .../aarch64/sve/pcs/return_4_1024.c | 2 - .../gcc.target/aarch64/sve/pcs/return_4_128.c | 2 - .../aarch64/sve/pcs/return_4_2048.c | 2 - .../gcc.target/aarch64/sve/pcs/return_4_256.c | 2 - .../gcc.target/aarch64/sve/pcs/return_4_512.c | 2 - .../gcc.target/aarch64/sve/pcs/return_5.c | 2 - .../aarch64/sve/pcs/return_5_1024.c | 2 - .../gcc.target/aarch64/sve/pcs/return_5_128.c | 2 - .../aarch64/sve/pcs/return_5_2048.c | 2 - .../gcc.target/aarch64/sve/pcs/return_5_256.c | 2 - .../gcc.target/aarch64/sve/pcs/return_5_512.c | 2 - 16 files changed, 814 insertions(+), 24 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast.c create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast128_run.c create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svlast256_run.c diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index cd9cace3c9b..db2b4dcaac9 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -1056,6 +1056,130 @@ class svlast_impl : public quiet public: CONSTEXPR svlast_impl (int unspec) : m_unspec (unspec) {} + bool is_lasta () const { return m_unspec == UNSPEC_LASTA; } + bool is_lastb () const { return m_unspec == UNSPEC_LASTB; } + + bool vect_all_same (tree v , int step) const + { +int i; +int nelts = vector_cst_encoded_nelts (v); +int first_el = 0; + +for (i = first_el; i < nelts; i += step) + if (VECTOR_CST_ENCODED_ELT (v, i) != VECTOR_CST_ENCODED_ELT (v, first_el)) + return false; + +return true; + } + + /* Fold a svlast{a/b} call with constant predicate to a BIT_FIELD_REF. + BIT_FIELD_REF lowers to a NEON element extract, so we have to make sure + the index of the element being accessed is in the range of a NEON vector + width. */ + gimple *fold (gimple_folder & f) const override + { +tree pred = gimple_call_arg (f.call, 0); +tree val = gimple_call_arg (f.call, 1); + +if (TREE_CODE (pred) == VECTOR_CST) + { + HOST_WIDE_INT pos; + unsigned int const_vg; + int i = 0; + int step = f.type_suffix (0).element_bytes; + int step_1 = gcd (step, VECTOR_CST_NPATTERNS (pred)); + int npats = VECTOR_CST_NPATTERNS (pred); + unsigned HOST_WIDE_