Hi all, Turns out __builtin_convertvector is not as good a fit for the widening and narrowing intrinsics as I had hoped. During the veclower phase we lower most of it to bitfield operations and hope DCE cleans it back up into vector pack/unpack and extend operations. I received reports that in more complex cases GCC fails to do that and we're left with many vector extract operations that clutter the output.
I think veclower can be improved on that front, but for GCC 10 I'd like to just implement these builtins with a good old RTL builtin rather than inline asm. Bootstrapped and tested on aarch64-none-linux-gnu. Pushing to trunk. Thanks, Kyrill gcc/ * config/aarch64/aarch64-simd.md (aarch64_<su>xtl<mode>): Define. (aarch64_xtn<mode>): Likewise. * config/aarch64/aarch64-simd-builtins.def (sxtl, uxtl, xtn): Define builtins. * config/aarch64/arm_neon.h (vmovl_s8): Reimplement using builtin. (vmovl_s16): Likewise. (vmovl_s32): Likewise. (vmovl_u8): Likewise. (vmovl_u16): Likewise. (vmovl_u32): Likewise. (vmovn_s16): Likewise. (vmovn_s32): Likewise. (vmovn_s64): Likewise. (vmovn_u16): Likewise. (vmovn_u32): Likewise. (vmovn_u64): Likewise.
vmovnl.patch
Description: vmovnl.patch