https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82189

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note starting GCC 14 on aarch64 we get:

        ldp     s31, s30, [x1]
        add     x1, x2, 4
        dup     v0.4s, v0.s[0]
        ld1     {v30.s}[1], [x1]
        ld1     {v31.s}[1], [x2]
        zip1    v30.4s, v31.4s, v30.4s
        fdiv    v0.4s, v30.4s, v0.4s
        str     q0, [x0]

And on the trunk we get:
        ldp     s31, s30, [x1]
        dup     v0.4s, v0.s[0]
        ldr     s29, [x2, 4]
        ld1     {v31.s}[1], [x2]
        uzp1    v30.2s, v30.2s, v29.2s
        zip1    v30.4s, v31.4s, v30.4s
        fdiv    v0.4s, v30.4s, v0.4s
        str     q0, [x0]

Which is slightly worse?

This is all from:
```
  _1 = *b_9(D);
  _3 = MEM[(float *)b_9(D) + 4B];
  _5 = *c_15(D);
  _7 = MEM[(float *)c_15(D) + 4B];
  _18 = {_1, _3, _5, _7};
```

```
#define vec8 __attribute__((vector_size(8)))
#define vec16 __attribute__((vector_size(16)))

vec16 float f1(float *restrict a, float * restrict b)
{
  vec8 float t = {a[0], a[1]};
  vec8 float t1 = {b[0], b[1]};
  return __builtin_shufflevector(t, t1, 0, 1, 2, 3);
}
vec16 float f2(float *restrict a, float * restrict b)
{
  vec16 float t = {a[0], a[1], b[0], b[1]};
  return t;
}
```

We can optimize f1 but not f2.

Reply via email to