https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989
--- Comment #13 from Matthijs van Duin <matthijsvanduin at gmail dot com> --- In case it's of interest, I did a quick benchmark of my testcase executed in a loop on a cortex-a8: Without neon: 12 instructions/iteration 14 cycles/iteration With neon: 14 instructions/iteration 35.2-35.3 cycles/iteration (This includes 4 instructions for the loop itself.) When using neon, the majority of the time is spent in a nasty pipeline stall for moving data from neon registers to arm registers, which takes a minimum of 20 cycles according to the cortex-a8 TRM.