Hi everyone,
So this past week I've been building on Rika's work by adding an
assembly version of SHA-1 for x86_64 to complement Rika's i386 version.
So far I've successfully made a version that runs twice as fast as the
Pascal code. I hoped to go even faster by making use of the SSE2
instruction set, but currently the end result is slower even though
computing the common parts of 4 rounds simultaneously should be much
faster. This occurs even when I forgo writing to the stack and keep
pretty much all of the state within registers. Preliminary
investigation suggests that the slowdown comes from using MOVD/Q to
transfer data between the XMM registers and general-purpose registers,
since they are different parts of the CPU. I'm still amazed it causes
this much latency though.
I'll keep investigating and seeing if I can squeeze out more
performance, but otherwise I may just have to fall back on a
non-SIMD-optimised implementation.
Kit
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel