https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013
ktkachov at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ktkachov at gcc dot gnu.org --- Comment #5 from ktkachov at gcc dot gnu.org --- I see vectorisation for arm (and aarch64 FWIW): -O3 -march=armv8-a -mfpu=neon-fp-armv8 -mfloat-abi=hard gives the loop: .L4: mov r3, lr add lr, lr, #48 vld3.8 {d16, d18, d20}, [r3]! vld3.8 {d17, d19, d21}, [r3] vmull.u8 q12, d16, d30 vmull.u8 q1, d18, d28 vmull.u8 q2, d19, d29 vmull.u8 q11, d17, d31 vmull.u8 q3, d20, d26 vadd.i16 q12, q12, q1 vmull.u8 q10, d21, d27 vadd.i16 q8, q11, q2 vadd.i16 q9, q12, q3 vadd.i16 q8, q8, q10 vshr.u16 q9, q9, #8 vshr.u16 q8, q8, #8 vmovn.i16 d20, q9 vmovn.i16 d21, q8 vst1.8 {q10}, [ip]! cmp ip, r4 bne .L4 Though of course it's not as tight as the assembly given in the link