Re: [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd

Richard Sandiford via Gcc-patches Tue, 08 Jun 2021 04:50:24 -0700

Christophe Lyon <christophe.l...@linaro.org> writes:
> On Wed, 2 Jun 2021 at 20:19, Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>>
>> Christophe Lyon <christophe.l...@linaro.org> writes:
>> > This patch adds support for auto-vectorization of average value
>> > computation using vhadd or vrhadd, for both MVE and Neon.
>> >
>> > The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
>> > vec-common.md, I'm not sure how to factorize them without introducing
>> > an unspec iterator?
>>
>> Yeah, an int iterator would be one way, but I'm not sure it would
>> make things better given the differences in how Neon and MVE handle
>> their unspecs.
>>
>> > It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
>> >
>> > Vectorization works with 8-bit and 16 bit input/output vectors, but
>> > not with 32-bit ones because the vectorizer expects wider types
>> > availability for the intermediate values, but int32_t + int32_t does
>> > not involve wider types in the IR.
>>
>> Right.  Like you say, it's only valid to use V(R)HADD if, in the source
>> code, the addition and shift have a wider precision than the operands.
>> That happens naturally for 8-bit and 16-bit operands, since C arithmetic
>> promotes them to "int" first.  But for 32-bit operands, the C code needs
>> to do the addition and shift in 64 bits.  Doing them in 64 bits should
>> be fine for narrower operands too.
>>
>> So:
>>
>> > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c 
>> > b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
>> > new file mode 100644
>> > index 00000000000..40489ecc67d
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
>> > @@ -0,0 +1,31 @@
>> > +/* { dg-do compile } */
>> > +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
>> > +/* { dg-add-options arm_v8_1m_mve } */
>> > +/* { dg-additional-options "-O3" } */
>> > +
>> > +#include <stdint.h>
>> > +
>> > +#define FUNC(SIGN, TYPE, BITS, OP, NAME)                             \
>> > +  void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ 
>> > dest, \
>> > +                                       TYPE##BITS##_t *a, TYPE##BITS##_t 
>> > *b) { \
>> > +    int i;                                                           \
>> > +    for (i=0; i < (128 / BITS); i++) {                                    
>> >    \
>> > +      dest[i] = (a[i] OP b[i]) >> 1;                                 \
>> > +    }                                                                     
>> >    \
>> > +}
>> > +
>>
>> …it should work if you make this "((int64_t) a[i] OP b[i]) >> 1".
>
> Indeed. However, this may not be obvious for end-users :-(
>
> I've updated my patch as attached: added the (int64_t) cast and
> removed the xfail clauses.
>
> OK for trunk?
>
> Thanks,
>
> Christophe
>
>>
>> > As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
>> > vectorization with 64-bit vectors: we default to
>> > -mvectorize-with-neon-quad, and attempts to use
>> > -mvectorize-with-neon-double resulted in much worse code, which this
>> > patch does not aim at improving.
>>
>> I guess this is because the MVE_2 mode iterators only include 128-bit types.
>> Leaving Neon double as future work sounds good though.
> Note that I am focusing on MVE enablement at the moment.


Right.  I meant “possible future work by someone somewhere”. :-)

>> And yeah, the code for V(R)HADD-equivalent operations is much worse when
>> V(R)HADD isn't available, since the compiler really does need to double
>> the precision of the operands, do double-precision addition,
>> do double-precision shifts, and then truncate back.  So this looks
>> like the expected behaviour.
>>
>> Thanks,
>> Richard
>
> From 493693b5c2f4e5fee7408062785930f723f2bd85 Mon Sep 17 00:00:00 2001
> From: Christophe Lyon <christophe.l...@linaro.org>
> Date: Thu, 27 May 2021 20:11:28 +0000
> Subject: [PATCH v2] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
>
> This patch adds support for auto-vectorization of average value
> computation using vhadd or vrhadd, for both MVE and Neon.
>
> The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
> vec-common.md, I'm not sure how to factorize them without introducing
> an unspec iterator?
>
> It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
>
> Vectorization works with 8-bit and 16 bit input/output vectors, but
> not with 32-bit ones because the vectorizer expects wider types
> availability for the intermediate values, but int32_t + int32_t does
> not involve wider types in the IR.
>
> As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
> vectorization with 64-bit vectors: we default to
> -mvectorize-with-neon-quad, and attempts to use
> -mvectorize-with-neon-double resulted in much worse code, which this
> patch does not aim at improving.

The above needs updating.

> 2021-05-31  Christophe Lyon  <christophe.l...@linaro.org>
>
>       gcc/
>       * gcc/config/arm/mve.md (mve_vhaddq_<supf><mode>): Prefix with '@'.
>       (@mve_vrhaddq_<supf><mode): Likewise.
>       * gcc/config/arm/neon.md (neon_v<r>hadd<sup><mode>): Likewise.
>       * config/arm/vec-common.md (avg<mode>3_floor, uavg<mode>3_floor)
>       (avg<mode>3_ceil", uavg<mode>3_ceil): New patterns.
>
>       gcc/testsuite/
>       * gcc.target/arm/simd/mve-vhadd-1.c: New test.
>       * gcc.target/arm/simd/mve-vhadd-2.c: New test.
>       * gcc.target/arm/simd/neon-vhadd-1.c: New test.
>       * gcc.target/arm/simd/neon-vhadd-2.c: New test.
> ---
>  gcc/config/arm/mve.md                         |  4 +-
>  gcc/config/arm/neon.md                        |  2 +-
>  gcc/config/arm/vec-common.md                  | 60 ++++++++++++
>  .../gcc.target/arm/simd/mve-vhadd-1.c         | 31 +++++++
>  .../gcc.target/arm/simd/mve-vhadd-2.c         | 33 +++++++
>  .../gcc.target/arm/simd/neon-vhadd-1.c        | 93 +++++++++++++++++++
>  .../gcc.target/arm/simd/neon-vhadd-2.c        | 33 +++++++
>  7 files changed, 253 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
>  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
>  create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
>  create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
>
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 0bfa6a91d55..04aa612331a 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -1030,7 +1030,7 @@ (define_insn "mve_vhaddq_n_<supf><mode>"
>  ;;
>  ;; [vhaddq_s, vhaddq_u])
>  ;;
> -(define_insn "mve_vhaddq_<supf><mode>"
> +(define_insn "@mve_vhaddq_<supf><mode>"
>    [
>     (set (match_operand:MVE_2 0 "s_register_operand" "=w")
>       (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
> @@ -1652,7 +1652,7 @@ (define_insn "mve_vqsubq_<supf><mode>"
>  ;;
>  ;; [vrhaddq_s, vrhaddq_u])
>  ;;
> -(define_insn "mve_vrhaddq_<supf><mode>"
> +(define_insn "@mve_vrhaddq_<supf><mode>"
>    [
>     (set (match_operand:MVE_2 0 "s_register_operand" "=w")
>       (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> index 077c62ffd20..18571d819eb 100644
> --- a/gcc/config/arm/neon.md
> +++ b/gcc/config/arm/neon.md
> @@ -1488,7 +1488,7 @@ (define_insn "neon_vaddw<sup><mode>"
>  
>  ; vhadd and vrhadd.
>  
> -(define_insn "neon_v<r>hadd<sup><mode>"
> +(define_insn "@neon_v<r>hadd<sup><mode>"
>    [(set (match_operand:VDQIW 0 "s_register_operand" "=w")
>          (unspec:VDQIW [(match_operand:VDQIW 1 "s_register_operand" "w")
>                      (match_operand:VDQIW 2 "s_register_operand" "w")]
> diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> index 80b273229f5..2779c1a8aaa 100644
> --- a/gcc/config/arm/vec-common.md
> +++ b/gcc/config/arm/vec-common.md
> @@ -565,3 +565,63 @@ (define_expand "reduc_plus_scal_<mode>"
>  
>    DONE;
>  })
> +
> +(define_expand "avg<mode>3_floor"
> +  [(match_operand:MVE_2 0 "s_register_operand")
> +   (match_operand:MVE_2 1 "s_register_operand")
> +   (match_operand:MVE_2 2 "s_register_operand")]
> +  "ARM_HAVE_<MODE>_ARITH"
> +{
> +  if (TARGET_HAVE_MVE)
> +    emit_insn (gen_mve_vhaddq (VHADDQ_S, <MODE>mode,
> +                            operands[0], operands[1], operands[2]));
> +  else
> +    emit_insn (gen_neon_vhadd (UNSPEC_VHADD_S, UNSPEC_VHADD_S, <MODE>mode,
> +                            operands[0], operands[1], operands[2]));
> +  DONE;
> +})
> +
> +(define_expand "uavg<mode>3_floor"
> +  [(match_operand:MVE_2 0 "s_register_operand")
> +   (match_operand:MVE_2 1 "s_register_operand")
> +   (match_operand:MVE_2 2 "s_register_operand")]
> +  "ARM_HAVE_<MODE>_ARITH"
> +{
> +  if (TARGET_HAVE_MVE)
> +    emit_insn (gen_mve_vhaddq (VHADDQ_U, <MODE>mode,
> +                            operands[0], operands[1], operands[2]));
> +  else
> +    emit_insn (gen_neon_vhadd (UNSPEC_VHADD_U, UNSPEC_VHADD_U, <MODE>mode,
> +                            operands[0], operands[1], operands[2]));
> +  DONE;
> +})
> +
> +(define_expand "avg<mode>3_ceil"
> +  [(match_operand:MVE_2 0 "s_register_operand")
> +   (match_operand:MVE_2 1 "s_register_operand")
> +   (match_operand:MVE_2 2 "s_register_operand")]
> +  "ARM_HAVE_<MODE>_ARITH"
> +{
> +  if (TARGET_HAVE_MVE)
> +    emit_insn (gen_mve_vrhaddq (VRHADDQ_S, <MODE>mode,
> +                             operands[0], operands[1], operands[2]));
> +  else
> +    emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_S, UNSPEC_VRHADD_S, <MODE>mode,
> +                            operands[0], operands[1], operands[2]));
> +  DONE;
> +})
> +
> +(define_expand "uavg<mode>3_ceil"
> +  [(match_operand:MVE_2 0 "s_register_operand")
> +   (match_operand:MVE_2 1 "s_register_operand")
> +   (match_operand:MVE_2 2 "s_register_operand")]
> +  "ARM_HAVE_<MODE>_ARITH"
> +{
> +  if (TARGET_HAVE_MVE)
> +    emit_insn (gen_mve_vrhaddq (VRHADDQ_U, <MODE>mode,
> +                             operands[0], operands[1], operands[2]));
> +  else
> +    emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_U, UNSPEC_VRHADD_U, <MODE>mode,
> +                            operands[0], operands[1], operands[2]));
> +  DONE;
> +})
> diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c 
> b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> new file mode 100644
> index 00000000000..19d5f5aa44f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
> +/* { dg-add-options arm_v8_1m_mve } */
> +/* { dg-additional-options "-O3" } */
> +
> +#include <stdint.h>
> +
> +/* We force a cast to int64_t to enable the vectorizer when dealing with 
> 32-bit
> +   inputs.  */
> +#define FUNC(SIGN, TYPE, BITS, OP, NAME)                             \
> +  void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ 
> dest, \
> +                                       TYPE##BITS##_t *a, TYPE##BITS##_t *b) 
> { \
> +    int i;                                                           \
> +    for (i=0; i < (128 / BITS); i++) {                                       
> \
> +      dest[i] = ((int64_t)a[i] OP b[i]) >> 1;                                
> \
> +    }                                                                        
> \
> +}
> +
> +FUNC(s, int, 32, +, vhadd)
> +FUNC(u, uint, 32, +, vhadd)
> +FUNC(s, int, 16, +, vhadd)
> +FUNC(u, uint, 16, +, vhadd)
> +FUNC(s, int, 8, +, vhadd)
> +FUNC(u, uint, 8, +, vhadd)
> +
> +/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c 
> b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
> new file mode 100644
> index 00000000000..b19d4253b54
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
> +/* { dg-add-options arm_v8_1m_mve } */
> +/* { dg-additional-options "-O3" } */
> +
> +#include <stdint.h>
> +
> +/* We force a cast to int64_t to enable the vectorizer when dealing with 
> 32-bit
> +   inputs.  */
> +#define FUNC(SIGN, TYPE, BITS, OP, NAME)                             \
> +  void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ 
> dest, \
> +                                       TYPE##BITS##_t *a, TYPE##BITS##_t *b) 
> { \
> +    int i;                                                           \
> +    for (i=0; i < (128 / BITS); i++) {                                       
> \
> +      dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1;                    \
> +    }                                                                        
> \
> +}
> +
> +FUNC(s, int, 32, +, vrhadd)
> +FUNC(u, uint, 32, +, vrhadd)
> +FUNC(s, int, 16, +, vrhadd)
> +FUNC(u, uint, 16, +, vrhadd)
> +FUNC(s, int, 8, +, vrhadd)
> +FUNC(u, uint, 8, +, vrhadd)
> +
> +/* We cannot make use of the 32 bits version because the vectorizer needs an
> +   extended precision for the intermediate computations.  */

This comment no longer applies.

> +/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, 
> q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, 
> q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, 
> q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, 
> q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c 
> b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
> new file mode 100644
> index 00000000000..cd00ca81c6d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
> @@ -0,0 +1,93 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_ok } */
> +/* { dg-add-options arm_neon } */
> +/* { dg-additional-options "-O3" } */
> +
> +#include <stdint.h>
> +
> +/* Since we default to -mvectorize-with-neon-quad, use enough iterations so 
> that
> +   we can vectorize using 128-bit vectors.  */
> +/* For instance with int16_t:
> +   8 iterations and -mvectorize-with-neon-quad, we generate:
> +test_vhadd_s16:
> +        vld1.16 {q8}, [r1]
> +        vld1.16 {q9}, [r2]
> +        vhadd.s16       q8, q8, q9
> +        vst1.16 {q8}, [r0]
> +        bx      lr
> +
> +   8 iterations and -mvectorize-with-neon-double, we generate:
> +test_vhadd_s16:
> +        vld1.16 {d24}, [r1]!
> +        vld1.16 {d26}, [r2]!
> +        vld1.16 {d22}, [r1]
> +        vmovl.s16 q12, d24
> +        vld1.16 {d20}, [r2]
> +        vmovl.s16 q13, d26
> +        vmovl.s16 q11, d22
> +        vmovl.s16 q10, d20
> +        vadd.i32        d28, d26, d24
> +        vadd.i32        d24, d27, d25
> +        vadd.i32        d25, d22, d20
> +        vadd.i32        d20, d23, d21
> +        vshr.s32        d18, d28, #1
> +        vshr.s32        d19, d24, #1
> +        vshr.s32        d16, d25, #1
> +        vshr.s32        d17, d20, #1
> +        vmovn.i32       d18, q9
> +        vmovn.i32       d16, q8
> +        vst1.16 {d18}, [r0]!
> +        vst1.16 {d16}, [r0]
> +        bx      lr
> +
> +  Adding a cast to avoid integer promotion:
> +  dest[i] = (int16_t)(a[i] + b[i]) >> 1
> +
> +  8 iterations and -mvectorize-with-neon-quad, we generate:
> +test_vhadd_s16:
> +        vld1.16 {q8}, [r2]
> +        vld1.16 {q9}, [r1]
> +        vadd.i16        q8, q8, q9
> +        vshr.s16        q8, q8, #1
> +        vst1.16 {q8}, [r0]
> +        bx      lr
> +
> +  8 iterations and -mvectorize-with-neon-double, we generate:
> +test_vhadd_s16:
> +        vld1.16 {d17}, [r1]!
> +        vld1.16 {d19}, [r2]!
> +        vld1.16 {d18}, [r1]
> +        vld1.16 {d16}, [r2]
> +        vadd.i16        d17, d17, d19
> +        vadd.i16        d16, d16, d18
> +        vshr.s16        d17, d17, #1
> +        vshr.s16        d16, d16, #1
> +        vst1.16 {d17}, [r0]!
> +        vst1.16 {d16}, [r0]
> +        bx      lr
> +
> + */

I think we should remove this and say that, at the moment, we've only
implemented the optabs for 128-bit vectors, so that's all that we
test here.

OK with those changes, thanks.

Richard

> +/* We force a cast to int64_t to enable the vectorizer when dealing with 
> 32-bit
> +   inputs.  */
> +#define FUNC(SIGN, TYPE, BITS, OP, NAME)                             \
> +  void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ 
> dest, \
> +                                       TYPE##BITS##_t *a, TYPE##BITS##_t *b) 
> { \
> +    int i;                                                           \
> +    for (i=0; i < (128 / BITS); i++) {                                       
> \
> +      dest[i] = ((int64_t)a[i] OP b[i]) >> 1;                                
> \
> +    }                                                                        
> \
> +}
> +
> +FUNC(s, int, 32, +, vhadd)
> +FUNC(u, uint, 32, +, vhadd)
> +FUNC(s, int, 16, +, vhadd)
> +FUNC(u, uint, 16, +, vhadd)
> +FUNC(s, int, 8, +, vhadd)
> +FUNC(u, uint, 8, +, vhadd)
> +  
> +/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c 
> b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
> new file mode 100644
> index 00000000000..f2692542f9b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_ok } */
> +/* { dg-add-options arm_neon } */
> +/* { dg-additional-options "-O3" } */
> +
> +#include <stdint.h>
> +
> +/* Since we default to -mvectorize-with-neon-quad, use enough iterations so 
> that
> +   we can vectorize using 128-bit vectors.  */
> +/* We force a cast to int64_t to enable the vectorizer when dealing with 
> 32-bit
> +   inputs.  */
> +#define FUNC(SIGN, TYPE, BITS, OP, NAME)                             \
> +  void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ 
> dest, \
> +                                       TYPE##BITS##_t *a, TYPE##BITS##_t *b) 
> { \
> +    int i;                                                           \
> +    for (i=0; i < (128 / BITS); i++) {                                       
> \
> +      dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1;                    \
> +    }                                                                        
> \
> +}
> +
> +FUNC(s, int, 32, +, vrhadd)
> +FUNC(u, uint, 32, +, vrhadd)
> +FUNC(s, int, 16, +, vrhadd)
> +FUNC(u, uint, 16, +, vrhadd)
> +FUNC(s, int, 8, +, vrhadd)
> +FUNC(u, uint, 8, +, vrhadd)
> +
> +/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, 
> q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, 
> q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, 
> q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, 
> q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 
> 1 } } */

Re: [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd

Reply via email to