https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793
Bug ID: 68793 Summary: Bad optimization by split-wide-type on NEON Product: gcc Version: 5.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Enabling the optimization 'split-wide-types' causes worse code for NEON intrinsics than disabling it, and it is enabled by default by -O1. It is triggered by multi-register intrinsics such as vst4 and vld4, and using a NEON-width wider than the native registers (128bit on aarch64 for instance). uint8x16x4_t tmp = vld4q_u8((const uint8_t*)(src + i)); vst4q_u8((uint8_t*)(dst + i), {tmp.val[2], tmp.val[1], tmp.val[0], tmp.val[3]}); with -fno-split-wide-types generates ld4 {v4.16b - v7.16b}, [x5] orr v0.16b, v6.16b, v6.16b orr v1.16b, v5.16b, v5.16b orr v2.16b, v4.16b, v4.16b orr v3.16b, v7.16b, v7.16b st4 {v0.16b - v3.16b}, [x4] But by default -O1 (with split-wide-types): ld4 {v0.16b - v3.16b}, [x5] umov x14, v2.d[0] umov x15, v2.d[1] umov x12, v1.d[0] umov x13, v1.d[1] umov x10, v0.d[0] umov x11, v0.d[1] stp x14, x15, [sp] str q3, [sp, 48] str x12, [sp, 16] stp x13, x10, [sp, 24] str x11, [sp, 40] ld1 {v0.16b - v3.16b}, [sp] st4 {v0.16b - v3.16b}, [x8]