https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989

--- Comment #13 from Matthijs van Duin <matthijsvanduin at gmail dot com> ---
In case it's of interest, I did a quick benchmark of my testcase executed in a
loop on a cortex-a8:

Without neon:
    12 instructions/iteration
    14 cycles/iteration

With neon:
    14 instructions/iteration
    35.2-35.3 cycles/iteration

(This includes 4 instructions for the loop itself.)

When using neon, the majority of the time is spent in a nasty pipeline stall
for moving data from neon registers to arm registers, which takes a minimum of
20 cycles according to the cortex-a8 TRM.

Reply via email to