Hi, As subject, this patch uses __builtin_memcpy to copy vector structures instead of using a union - or constructing a new opaque structure one vector at a time - in each of the vst[234][q] and vst1[q]_x[234] bfloat Neon intrinsics in arm_neon.h.
It also adds new code generation tests to verify that superfluous move instructions are not generated for the vst[234]q or vst1q_x[234] bfloat intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-30 Jonathan Wright <jonathan.wri...@arm.com> * config/aarch64/arm_neon.h (vst1_bf16_x2): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_oi one vector at a time. (vst1q_bf16_x2): Likewise. (vst1_bf16_x3): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_ci one vector at a time. (vst1q_bf16_x3): Likewise. (vst1_bf16_x4): Use __builtin_memcpy instead of a union. (vst1q_bf16_x4): Likewise. (vst2_bf16): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_oi one vector at a time. (vst2q_bf16): Likewise. (vst3_bf16): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_ci mode one vector at a time. (vst3q_bf16): Likewise. (vst4_bf16): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_xi one vector at a time. (vst4q_bf16): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests.
rb14731.patch
Description: rb14731.patch