https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ktkachov at gcc dot gnu.org

--- Comment #5 from ktkachov at gcc dot gnu.org ---
I see vectorisation for arm (and aarch64 FWIW):
-O3 -march=armv8-a -mfpu=neon-fp-armv8 -mfloat-abi=hard

gives the loop:
.L4:
        mov     r3, lr
        add     lr, lr, #48
        vld3.8  {d16, d18, d20}, [r3]!
        vld3.8  {d17, d19, d21}, [r3]
        vmull.u8 q12, d16, d30
        vmull.u8 q1, d18, d28
        vmull.u8 q2, d19, d29
        vmull.u8 q11, d17, d31
        vmull.u8 q3, d20, d26
        vadd.i16        q12, q12, q1
        vmull.u8 q10, d21, d27
        vadd.i16        q8, q11, q2
        vadd.i16        q9, q12, q3
        vadd.i16        q8, q8, q10
        vshr.u16        q9, q9, #8
        vshr.u16        q8, q8, #8
        vmovn.i16       d20, q9
        vmovn.i16       d21, q8
        vst1.8  {q10}, [ip]!
        cmp     ip, r4
        bne     .L4

Though of course it's not as tight as the assembly given in the link

Reply via email to