Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-23 Thread David Edelsohn
On Sun, Jan 23, 2022 at 4:41 PM Maamoun TK wrote: > > On Sun, Jan 23, 2022 at 9:10 PM Niels Möller wrote: > > > ni...@lysator.liu.se (Niels Möller) writes: > > > > > The current C implementation uses radix 26, and 25 multiplies (32x32 > > > --> 64) per block. And quite a lot of shifts. A radix

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-23 Thread Maamoun TK
On Sun, Jan 23, 2022 at 9:10 PM Niels Möller wrote: > ni...@lysator.liu.se (Niels Möller) writes: > > > The current C implementation uses radix 26, and 25 multiplies (32x32 > > --> 64) per block. And quite a lot of shifts. A radix 32 variant > > analogous to the above would need 16 long

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-23 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: > The current C implementation uses radix 26, and 25 multiplies (32x32 > --> 64) per block. And quite a lot of shifts. A radix 32 variant > analogous to the above would need 16 long multiplies and 4 short. I'd > expect that to be faster on most