https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109153
Bug ID: 109153 Summary: missed vector constructor optimizations Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Blocks: 47562 Target Milestone: --- Target: aarch64* The following example: #include <arm_neon.h> #include <string.h> uint32x2_t foo(const uint8_t *buf, int stride) { uint32_t a0, a1; memcpy(&a0, buf, 4); memcpy(&a1, buf + stride, 4); uint32x2_t a_u32 = vdup_n_u32(a0); return vset_lane_u32(a1, a_u32, 1); } generates foo: add x1, x0, w1, sxtw ld1r {v0.2s}, [x0] ld1 {v0.s}[1], [x1] ret where the initial value is replicated to then be overwritten. At the gimple level we have: _9 = {_4, _4}; __vec_10 = BIT_INSERT_EXPR <_9, _7, 32 (32 bits)>; which should have been optimized to: _9 = {_4, _7 } but this cannot be blindly done as the resulting vector needs to be cheaper. For instance if it was _9 = {_4, _4, _4, _4}; __vec_10 = BIT_INSERT_EXPR <_9, _7, 32 (32 bits)>; this wouldn't have been cheaper. Similarly another testcase gives <bb 2> [local count: 1073741824]: _4 = VEC_PERM_EXPR <a_2(D), b_3(D), { 0, 8, 1, 9, 2, 10, 3, 11 }>; _5 = VEC_PERM_EXPR <a_2(D), b_3(D), { 4, 12, 5, 13, 6, 14, 7, 15 }>; _6 = {_4, _5}; return _6; which should have been just a singe VEC_PERM_EXPR. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562 [Bug 47562] [meta-bug] keep track of Neon Intrinsics enhancements