On Mon, 20 Dec 2004, Marc Bevand wrote:
> SHA-1: Dean already worked on this, using SSE2.
it looks like the openssl cvs HEAD generally beats my sha1 code for 32-bit
x86 platforms in most cases now, and generally ties my sha256 code when
compiled with gcc... nice work Andy.
here's some data i collected saturday -- it's all in cycles per byte
(lower is better) for 8192-byte buffer:
openssl SSE2 SSE2 SSE2
cvs head gcc-cvs gcc34 icc71
sha1:
p4 model 3 9.59 16.9 21.4 14.2
p4 model 2 10.6 15.4 28.4 13.5
p-m 10.3 15.0 14.4 13.3
k8 8.18 10.4 10.3 8.70
efficeon 9.40 7.1 7.04 6.20
sha256:
p4 model 3 31.8 51.8 38.8 31.3
p4 model 2 38.6 46.5 38.1 39.2
p-m 32.7 34.3 32.0 29.0
k8 25.9 29.2 22.2 21.6
efficeon 27.9 20.9 15.4 16.4
notes:
- openssl cvs head as of 20041218, 32-bit x86 only (even on k8 and p4-3)
- gcc cvs head as of 20041218
- gcc-3.4.2-3 debian package
- intel C compiler 7.1 build 20030307Z
- p4 model 2 is most widespread p4; p4 model 3 are the recent chips only
-- all EM64T are p4 model 3
notice how all over the map gcc is. efficeon generally doesn't care much
because we reschedule/reoptimize everything anyhow. but lacking any sort
of sane results from gcc i'm not going advocate even my sha256 code for
general consumption yet.
in case anyone wants the code it's at
<http://arctic.org/~dean/crypto/sha-sse2-20041218.tar.bz2>
> RSA: The compiler already does a good job with 64-bit arithmetic.
there's potential to improve DSA on 32-bit x86 with further application of
the sse2 techniques already committed to 32-bit x86 for RSA ... basically
the special-purpose bn squaring code can be extended as well. it'll still
be nowhere near as good as 64-bit native though, interesting only to folks
still on 32-bit platforms.
> My first step will be to study the only existing AMD64 implementation of
> AES: loop-aes, merged in Linux kernel 2.6.8-rc3 by Brian Gladman.
yeah gladman aes is the way to go ... the gladman code and linux-kernel
variations on it (including x86-64 port) are really well balanced across
cpus, and don't do anything obscene like use huge tables or use %esp as a
temp register (ScienceMark AES windoze benchmark uses %esp as a temp
register! it saves %esp off in a global for the duration... definitely
not safe in portable code, and only works on windoze 'cause they don't
have signal frames on the stack).
i know how i could do a better job natively on efficeon using tables only
twice as large as gladman, but that's also "breaking the rules" :)
-dean
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [EMAIL PROTECTED]
Automated List Manager [EMAIL PROTECTED]