On Sun, Oct 11, 2020 at 1:42 PM Niels Möller <ni...@lysator.liu.se> wrote:
>
> ni...@lysator.liu.se (Niels Möller) writes:
>
> > So if we have the input in register A (loaded from memory with no
> > processing besides ensuring proper *byte* order), and precompute two
> > values, M representing b_1(x) x^64 + c_1(x), and L representing b_0(x)
> > x^64 + d_1(x)), then we get the two halves above with two vpmsumd,
> >
> >   vpmsumd R, M, A
> >   vpmsumd F, L, A
> >
> > When doing more than one block at a time, I think it's easiest to
> > accumulate the R and F values separately.
>
> BTW, I wonder if similar organization would make sense for Arm Neon.
> Now, Neon doesn't have vpmsumd, the widest carryless multiplication
> available is vmull.p8, which is an 8-bit to 15-bit multiply, 8 in
> parallel...

I may be mistaken, but I believe 64-bit poly multiplies are available.
Or they are available on Aarch64 with Crypto extensions.

I'm not aware of poly multiplies on other ARM arches, like ARMv6 or
ARMv7 with NEON.

Jeff
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to