Re: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition

Richard Biener Wed, 14 Nov 2018 04:27:30 -0800

On Sun, Nov 11, 2018 at 11:28 AM Tamar Christina
<tamar.christ...@arm.com> wrote:
>
> Hi All,
>
> This patch adds the expander support for supporting autovectorization of 
> complex number operations
> such as Complex addition with a rotation along the Argand plane.  This also 
> adds support for complex
> FMA.
>
> The instructions are described in the ArmARM [1] and are available from 
> Armv8.3-a onwards.
>
> Concretely, this generates
>
> f90:
>         add     ip, r1, #15
>         add     r3, r0, #15
>         sub     r3, r3, r2
>         sub     ip, ip, r2
>         cmp     ip, #30
>         cmphi   r3, #30
>         add     r3, r0, #1600
>         bls     .L5
> .L3:
>         vld1.32 {q8}, [r0]!
>         vld1.32 {q9}, [r1]!
>         vcadd.f32       q8, q8, q9, #90
>         vst1.32 {q8}, [r2]!
>         cmp     r0, r3
>         bne     .L3
>         bx      lr
> .L5:
>         vld1.32 {d16}, [r0]!
>         vld1.32 {d17}, [r1]!
>         vcadd.f32       d16, d16, d17, #90
>         vst1.32 {d16}, [r2]!
>         cmp     r0, r3
>         bne     .L5
>         bx      lr
>
>
>
> now instead of
>
> f90:
>         add     ip, r1, #31
>         add     r3, r0, #31
>         sub     r3, r3, r2
>         sub     ip, ip, r2
>         cmp     ip, #62
>         cmphi   r3, #62
>         add     r3, r0, #1600
>         bls     .L2
> .L3:
>         vld2.32 {d20-d23}, [r0]!
>         vld2.32 {d24-d27}, [r1]!
>         cmp     r0, r3
>         vsub.f32        q8, q10, q13
>         vadd.f32        q9, q12, q11
>         vst2.32 {d16-d19}, [r2]!
>         bne     .L3
>         bx      lr
> .L2:
>         vldr    d19, .L10
> .L5:
>         vld1.32 {d16}, [r1]!
>         vld1.32 {d18}, [r0]!
>         vrev64.32       d16, d16
>         cmp     r0, r3
>         vsub.f32        d17, d18, d16
>         vadd.f32        d16, d16, d18
>         vswp    d16, d17
>         vtbl.8  d16, {d16, d17}, d19
>         vst1.32 {d16}, [r2]!
>         bne     .L5
>         bx      lr
> .L11:
>         .align  3
> .L10:
>         .byte   0
>         .byte   1
>         .byte   2
>         .byte   3
>         .byte   12
>         .byte   13
>         .byte   14
>         .byte   15
>
>
> For complex additions with a 90* rotation along the Argand plane.
>
> [1] 
> https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
>
> Bootstrap and Regtest on aarch64-none-linux-gnu, arm-none-gnueabihf and 
> x86_64-pc-linux-gnu
> are still on going but previous patch showed no regressions.
>
> The instructions have also been tested on aarch64-none-elf and arm-none-eabi 
> on a Armv8.3-a model
> and -march=Armv8.3-a+fp16 and all tests pass.
>
> Ok for trunk?


+;; The complex mla operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "fcmla<rot><mode>4"
+  [(set (match_operand:VF 0 "register_operand")
+       (plus:VF (match_operand:VF 1 "register_operand")
+                (unspec:VF [(match_operand:VF 2 "register_operand")
+                            (match_operand:VF 3 "register_operand")]
+                            VCMLA)))]
+  "TARGET_COMPLEX"
+{
+  emit_insn (gen_neon_vcmla<rotsplit1><mode> (operands[0], operands[1],
+                                             operands[2], operands[3]));
+  emit_insn (gen_neon_vcmla<rotsplit2><mode> (operands[0], operands[0],
+                                             operands[2], operands[3]));
+  DONE;
+})

What's the two halves?  Why hide this from the vectorizer if you go down all to
the detail and expose the rotation to it?

+;; The vcadd and vcmla patterns are made UNSPEC for the explicitly due to the
+;; fact that their usage need to guarantee that the source vectors are
+;; contiguous.  It would be wrong to describe the operation without being able
+;; to describe the permute that is also required, but even if that is done
+;; the permute would have been created as a LOAD_LANES which means the values
+;; in the registers are in the wrong order.

Hmm, it's totally non-obvious to me how this relates to loads or what
a "non-contiguous"
register would be?  That is, once you make this an unspec combine will
never be able
to synthesize this from intrinsics code that doesn't use this form.

+(define_insn "neon_vcadd<rot><mode>"
+  [(set (match_operand:VF 0 "register_operand" "=w")
+       (unspec:VF [(match_operand:VF 1 "register_operand" "w")
+                   (match_operand:VF 2 "register_operand" "w")]
+                   VCADD))]


> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 2018-11-11  Tamar Christina  <tamar.christ...@arm.com>
>
>         * config/arm/arm.c (arm_arch8_3, arm_arch8_4): New.
>         * config/arm/arm.h (TARGET_COMPLEX, arm_arch8_3, arm_arch8_4): New.
>         (arm_option_reconfigure_globals): Use them.
>         * config/arm/iterators.md (VDF, VQ_HSF): New.
>         (VCADD, VCMLA): New.
>         (VF_constraint, rot, rotsplit1, rotsplit2): Add V4HF and V8HF.
>         * config/arm/neon.md (neon_vcadd<rot><mode>, fcadd<rot><mode>3,
>         neon_vcmla<rot><mode>, fcmla<rot><mode>4): New.
>         * config/arm/unspecs.md (UNSPEC_VCADD90, UNSPEC_VCADD270,
>         UNSPEC_VCMLA, UNSPEC_VCMLA90, UNSPEC_VCMLA180, UNSPEC_VCMLA270): New.
>
> gcc/testsuite/ChangeLog:
>
> 2018-11-11  Tamar Christina  <tamar.christ...@arm.com>
>
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_1.c: Add Arm 
> support.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_2.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_3.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_4.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_5.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-arrays_6.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_1.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_2.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_3.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_4.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_5.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcadd-complex_6.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_1.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_1.c: 
> Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_2.c: 
> Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_180_3.c: 
> Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_2.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_1.c: 
> Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_2.c: 
> Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_270_3.c: 
> Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_3.c: Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_1.c: 
> Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_2.c: 
> Likewise.
>         * gcc.target/aarch64/advsimd-intrinsics/vcmla-complex_90_3.c: 
> Likewise.
>
> --

Re: [PATCH 8/9][GCC][Arm] Add autovectorization support for complex multiplication and addition

Reply via email to