Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-27 Thread Niels Möller
Maamoun TK writes: > Great! I believe this is the best we can get for processing one block. One may be able to squeeze out one or two cycles more using the mulx extension, which should make it possible to eliminate some of the move instructions (I don't think moves cost any execution unit

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-27 Thread Maamoun TK
On Thu, Jan 27, 2022 at 11:28 PM Niels Möller wrote: > ni...@lysator.liu.se (Niels Möller) writes: > > >> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version. > > > > And I've now tried the same method for the x86_64 implementation. See > > attached file + needed patch to

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-27 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: >> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version. > > And I've now tried the same method for the x86_64 implementation. See > attached file + needed patch to asm.m4. This gives 2.9 GByte/s. > > I'm not entirely sure cycle numbers