Maamoun TK writes:
> Great! I believe this is the best we can get for processing one block.
One may be able to squeeze out one or two cycles more using the mulx
extension, which should make it possible to eliminate some of the move
instructions (I don't think moves cost any execution unit
On Thu, Jan 27, 2022 at 11:28 PM Niels Möller wrote:
> ni...@lysator.liu.se (Niels Möller) writes:
>
> >> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
> >
> > And I've now tried the same method for the x86_64 implementation. See
> > attached file + needed patch to
ni...@lysator.liu.se (Niels Möller) writes:
>> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
>
> And I've now tried the same method for the x86_64 implementation. See
> attached file + needed patch to asm.m4. This gives 2.9 GByte/s.
>
> I'm not entirely sure cycle numbers