https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99195
Bug ID: 99195 Summary: Optimise away vec_concat of 64-bit AdvancedSIMD operations with zeroes in aarch64 Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 Motivating testcases: #include <arm_neon.h> #define ONE(OT,IT,OP,S) \ OT \ foo_##OP##_##S (IT a, IT b) \ { \ IT zeros = vcreate_##S (0); \ return vcombine_##S (v##OP##_##S (a, b), zeros); \ } #define FUNC(T,IS,OS,OP,S) ONE (T##x##OS##_t, T##x##IS##_t, OP, S) #define OPTWO(T,IS,OS,S,OP1,OP2) \ FUNC (T, IS, OS, OP1, S) \ FUNC (T, IS, OS, OP2, S) #define OPTHREE(T, IS, OS, S, OP1, OP2, OP3) \ FUNC (T, IS, OS, OP1, S) \ OPTWO (T, IS, OS, S, OP2, OP3) #define OPFOUR(T,IS,OS,S,OP1,OP2,OP3,OP4) \ FUNC (T, IS, OS, OP1, S) \ OPTHREE (T, IS, OS, S, OP2, OP3, OP4) #define OPFIVE(T,IS,OS,S,OP1,OP2,OP3,OP4, OP5) \ FUNC (T, IS, OS, OP1, S) \ OPFOUR (T, IS, OS, S, OP2, OP3, OP4, OP5) #define OPSIX(T,IS,OS,S,OP1,OP2,OP3,OP4,OP5,OP6) \ FUNC (T, IS, OS, OP1, S) \ OPFIVE (T, IS, OS, S, OP2, OP3, OP4, OP5, OP6) OPSIX (int8, 8, 16, s8, add, sub, mul, and, orr, eor) OPSIX (int16, 4, 8, s16, add, sub, mul, and, orr, eor) OPSIX (int32, 2, 4, s32, add, sub, mul, and, orr, eor) OPFIVE (int64, 1, 2, s64, add, sub, and, orr, eor) OPSIX (uint8, 8, 16, u8, add, sub, mul, and, orr, eor) OPSIX (uint16, 4, 8, u16, add, sub, mul, and, orr, eor) OPSIX (uint32, 2, 4, u32, add, sub, mul, and, orr, eor) OPFIVE (uint64, 1, 2, u64, add, sub, and, orr, eor) for example generates: foo_add_s8: add v0.8b, v0.8b, v1.8b mov v0.8b, v0.8b ret The 64-bit V8QI ADD instruction implicitly zeroes out the top bits of the 128-bit destination so the vec_concat with zeroes can be represented easily. However we don't have such pattern for all the AdvancedSIMd operations that we support. Indeed, it would bloat the MD files quite a bit. Can we come up with a define_subst scheme to auto-generate the patterns to match things like: (set (reg:V16QI 93 [ <retval> ]) (vec_concat:V16QI (plus:V8QI (reg:V8QI 98) (reg:V8QI 99)) (const_vector:V8QI [ (const_int 0 [0]) repeated x8 ]))) ? Then we should be able to just generate: foo_add_s8: add v0.8b, v0.8b, v1.8b ret etc. The testcase above shows the problem for some of the simple binary ops, but there are many more instructions that can benefit from this.