Hi everyone,

So this past week I've been building on Rika's work by adding an assembly version of SHA-1 for x86_64 to complement Rika's i386 version.  So far I've successfully made a version that runs twice as fast as the Pascal code.  I hoped to go even faster by making use of the SSE2 instruction set, but currently the end result is slower even though computing the common parts of 4 rounds simultaneously should be much faster.  This occurs even when I forgo writing to the stack and keep pretty much all of the state within registers.  Preliminary investigation suggests that the slowdown comes from using MOVD/Q to transfer data between the XMM registers and general-purpose registers, since they are different parts of the CPU.  I'm still amazed it causes this much latency though.

I'll keep investigating and seeing if I can squeeze out more performance, but otherwise I may just have to fall back on a non-SIMD-optimised implementation.

Kit

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to