https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

            Bug ID: 68793
           Summary: Bad optimization by split-wide-type on NEON
           Product: gcc
           Version: 5.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linux at carewolf dot com
  Target Milestone: ---

Enabling the optimization 'split-wide-types' causes worse code for NEON
intrinsics than disabling it, and it is enabled by default by -O1.

It is triggered by multi-register intrinsics such as vst4 and vld4, and using a
NEON-width wider than the native registers (128bit on aarch64 for instance).

uint8x16x4_t tmp = vld4q_u8((const uint8_t*)(src + i));
vst4q_u8((uint8_t*)(dst + i), {tmp.val[2], tmp.val[1], tmp.val[0],
tmp.val[3]});

with -fno-split-wide-types generates
        ld4     {v4.16b - v7.16b}, [x5]
        orr     v0.16b, v6.16b, v6.16b
        orr     v1.16b, v5.16b, v5.16b
        orr     v2.16b, v4.16b, v4.16b
        orr     v3.16b, v7.16b, v7.16b
        st4     {v0.16b - v3.16b}, [x4]

But by default -O1 (with split-wide-types):
        ld4     {v0.16b - v3.16b}, [x5]
        umov    x14, v2.d[0]
        umov    x15, v2.d[1]
        umov    x12, v1.d[0]
        umov    x13, v1.d[1]
        umov    x10, v0.d[0]
        umov    x11, v0.d[1]
        stp     x14, x15, [sp]
        str     q3, [sp, 48]
        str     x12, [sp, 16]
        stp     x13, x10, [sp, 24]
        str     x11, [sp, 40]
        ld1     {v0.16b - v3.16b}, [sp]
        st4     {v0.16b - v3.16b}, [x8]

Reply via email to