ni...@lysator.liu.se (Niels Möller) writes: > For processors that can issue two instructions per cycle, and with > shorter latency, scalar code (i.e., code using only the general purpose > 32-bit registers) could get more or less the same throughput. The scalar > code also gets the advantage that there's a handy rotate instruction > (instead of the shift right + shift left + combine used in the Neon > code), but it has the disadvantage of register shortage, and will need a > bunch of load and store instructions to access the state. > > That doesn't quite explain why I saw a 45% speedup with Neon in 2013, > which has now disappeared. But maybe current gcc has good enough > instruction scheduling to produce code that can issue 2 instructions per > cycle on Cortex-A9 (which has quite limited out-of-order capabilities), > and gcc back then couldn't do that? > > So what's next? Should the old code just be deleted? > > With the new 2-way or 3-way functions, performance of the single-block > functions isn't that critical, so deletion may be ok even if it causes > some small regression on some processors (e.g., single-block chacha > getting 13% slower on the old Cortex-A9)
I've made a branch with deletion of this code, "delete-1-way-neon". Any comments before I merge to master? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs