https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

            Bug ID: 98877
           Summary: [AArch64] Inefficient code generated for tbl NEON
                    intrinsics
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: spop at gcc dot gnu.org
  Target Milestone: ---

The use of NEON intrinsics is inefficient and leads developers to prefer inline
assembly instead of intrinsics.

A similar performance bug for vmlal intrinsics was reported in
https://gcc.gnu.org/PR92665
The code generated by GCC for table lookups is also inefficient:

$ cat red.c
#include "arm_neon.h"

uint8x16_t fun(uint8x16_t lo, uint8x16_t hi, uint8x16_t idx) {
  uint8x16x2_t tab = { .val = {lo, hi} };
  uint8x16_t res = vqtbl2q_u8(tab, idx);
  return res;
}

$ gcc -O3 -S -o- red.c
fun:
        mov     v4.16b, v0.16b
        mov     v5.16b, v1.16b
        tbl     v0.16b, {v4.16b - v5.16b}, v2.16b
        ret

$ clang -O3 -S -o- red.c
fun:
        tbl     v0.16b, { v0.16b, v1.16b }, v2.16b
        ret

Reply via email to