ni...@lysator.liu.se (Niels Möller) writes:

> For processors that can issue two instructions per cycle, and with
> shorter latency, scalar code (i.e., code using only the general purpose
> 32-bit registers) could get more or less the same throughput. The scalar
> code also gets the advantage that there's a handy rotate instruction
> (instead of the shift right + shift left + combine used in the Neon
> code), but it has the disadvantage of register shortage, and will need a
> bunch of load and store instructions to access the state.
>
> That doesn't quite explain why I saw a 45% speedup with Neon in 2013,
> which has now disappeared. But maybe current gcc has good enough
> instruction scheduling to produce code that can issue 2 instructions per
> cycle on Cortex-A9 (which has quite limited out-of-order capabilities),
> and gcc back then couldn't do that?
>
> So what's next? Should the old code just be deleted? 
>
> With the new 2-way or 3-way functions, performance of the single-block
> functions isn't that critical, so deletion may be ok even if it causes
> some small regression on some processors (e.g., single-block chacha
> getting 13% slower on the old Cortex-A9)

I've made a branch with deletion of this code, "delete-1-way-neon". Any
comments before I merge to master?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to