https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877
Bug ID: 98877 Summary: [AArch64] Inefficient code generated for tbl NEON intrinsics Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spop at gcc dot gnu.org Target Milestone: --- The use of NEON intrinsics is inefficient and leads developers to prefer inline assembly instead of intrinsics. A similar performance bug for vmlal intrinsics was reported in https://gcc.gnu.org/PR92665 The code generated by GCC for table lookups is also inefficient: $ cat red.c #include "arm_neon.h" uint8x16_t fun(uint8x16_t lo, uint8x16_t hi, uint8x16_t idx) { uint8x16x2_t tab = { .val = {lo, hi} }; uint8x16_t res = vqtbl2q_u8(tab, idx); return res; } $ gcc -O3 -S -o- red.c fun: mov v4.16b, v0.16b mov v5.16b, v1.16b tbl v0.16b, {v4.16b - v5.16b}, v2.16b ret $ clang -O3 -S -o- red.c fun: tbl v0.16b, { v0.16b, v1.16b }, v2.16b ret