https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109153

            Bug ID: 109153
           Summary: missed vector constructor optimizations
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
            Blocks: 47562
  Target Milestone: ---
            Target: aarch64*

The following example:

#include <arm_neon.h>
#include <string.h>

uint32x2_t foo(const uint8_t *buf, int stride) {
  uint32_t a0, a1;
  memcpy(&a0, buf, 4);
  memcpy(&a1, buf + stride, 4);
  uint32x2_t a_u32 = vdup_n_u32(a0);
  return vset_lane_u32(a1, a_u32, 1);
}

generates

foo:
        add     x1, x0, w1, sxtw
        ld1r    {v0.2s}, [x0]
        ld1     {v0.s}[1], [x1]
        ret

where the initial value is replicated to then be overwritten.

At the gimple level we have:

  _9 = {_4, _4};
  __vec_10 = BIT_INSERT_EXPR <_9, _7, 32 (32 bits)>;

which should have been optimized to:

  _9 = {_4, _7 }

but this cannot be blindly done as the resulting vector needs to be
cheaper.

For instance if it was

  _9 = {_4, _4, _4, _4};
  __vec_10 = BIT_INSERT_EXPR <_9, _7, 32 (32 bits)>;

this wouldn't have been cheaper.

Similarly another testcase gives

  <bb 2> [local count: 1073741824]:
  _4 = VEC_PERM_EXPR <a_2(D), b_3(D), { 0, 8, 1, 9, 2, 10, 3, 11 }>;
  _5 = VEC_PERM_EXPR <a_2(D), b_3(D), { 4, 12, 5, 13, 6, 14, 7, 15 }>;
  _6 = {_4, _5};
  return _6;

which should have been just a singe VEC_PERM_EXPR.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562
[Bug 47562] [meta-bug] keep track of Neon Intrinsics enhancements

Reply via email to