https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125303

            Bug ID: 125303
           Summary: vector operation before shuffle produces unvectorized
                    shuffle
           Product: gcc
           Version: 16.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lists at coryfields dot com
  Target Milestone: ---

I'm not sure exactly what the precondition is for breaking the vectorized
shuffle, but the following illustrates the issue by doing an xor first:

typedef unsigned vec256 __attribute__((__vector_size__(32)));
void  vec_xor(vec256& x)
{
    x ^= 1;
}
void vec_shuf(vec256& x)
{
    x = (vec256){x[4], x[0], x[5], x[1], x[6], x[2], x[7], x[3]};
}
void vec_xor_shuf(vec256& x)
{
    x ^= 1;
    x = (vec256){x[4], x[0], x[5], x[1], x[6], x[2], x[7], x[3]};
}

Godbolt link: https://godbolt.org/z/TEKWjx8jf

vec_xor and vec_shuf look as expected.

But vec_xor_shuf breaks down into a mess of loads and non-vectorized
operations.

aarch64 is perhaps the worst offender.
On aarch64, clang produces:

vec_xor_shuf(unsigned int vector[8]&):
        movi    v0.4s, #1
        ldp     q1, q2, [x0]
        eor     v4.16b, v1.16b, v0.16b
        eor     v3.16b, v2.16b, v0.16b
        st2     { v3.4s, v4.4s }, [x0]
        ret

While gcc16 produces:

vec_xor_shuf(unsigned int __vector(8)&):
        ldp     q30, q31, [x0]
        mov     x3, 0
        movi    v29.4s, 0x1
        mov     x1, 0
        mov     x2, 0
        eor     v30.16b, v30.16b, v29.16b
        eor     v29.16b, v31.16b, v29.16b
        movi    v31.2d, #0
        dup     s28, v29.s[1]
        ins     v31.s[0], v29.s[0]
        fmov    x4, d28
        dup     s28, v30.s[1]
        ins     v31.s[1], v30.s[0]
        bfi     x3, x4, 0, 32
        fmov    x4, d28
        dup     s28, v29.s[2]
        dup     s29, v29.s[3]
        bfi     x3, x4, 32, 32
        fmov    x4, d28
        dup     s28, v30.s[2]
        dup     s30, v30.s[3]
        bfi     x1, x4, 0, 32
        fmov    x4, d28
        bfi     x1, x4, 32, 32
        fmov    x4, d29
        bfi     x2, x4, 0, 32
        fmov    x4, d30
        bfi     x2, x4, 32, 32
        fmov    x4, d31
        stp     x1, x2, [x0, 16]
        stp     x4, x3, [x0]
        ret

x86_64 with -mavx fares poorly as well.

This causes a vectorized impl of chacha20 to be _MUCH_ slower with gcc than
clang.

Reply via email to