Re: MD5 optimized for AMD64 (+65% speedup)

dean gaudet Mon, 20 Dec 2004 17:56:22 -0800

On Mon, 20 Dec 2004, Marc Bevand wrote:

> SHA-1: Dean already worked on this, using SSE2.


it looks like the openssl cvs HEAD generally beats my sha1 code for 32-bit 
x86 platforms in most cases now, and generally ties my sha256 code when 
compiled with gcc... nice work Andy.

here's some data i collected saturday -- it's all in cycles per byte 
(lower is better) for 8192-byte buffer:

                openssl         SSE2            SSE2            SSE2
                cvs head        gcc-cvs         gcc34           icc71

sha1:

p4 model 3       9.59           16.9            21.4            14.2
p4 model 2      10.6            15.4            28.4            13.5
p-m             10.3            15.0            14.4            13.3
k8               8.18           10.4            10.3             8.70
efficeon         9.40            7.1             7.04            6.20

sha256:

p4 model 3      31.8            51.8            38.8            31.3
p4 model 2      38.6            46.5            38.1            39.2
p-m             32.7            34.3            32.0            29.0
k8              25.9            29.2            22.2            21.6
efficeon        27.9            20.9            15.4            16.4

notes:

- openssl cvs head as of 20041218, 32-bit x86 only (even on k8 and p4-3)
- gcc cvs head as of 20041218
- gcc-3.4.2-3 debian package
- intel C compiler 7.1 build 20030307Z
- p4 model 2 is most widespread p4; p4 model 3 are the recent chips only
  -- all EM64T are p4 model 3

notice how all over the map gcc is.  efficeon generally doesn't care much 
because we reschedule/reoptimize everything anyhow.  but lacking any sort 
of sane results from gcc i'm not going advocate even my sha256 code for 
general consumption yet.

in case anyone wants the code it's at 
<http://arctic.org/~dean/crypto/sha-sse2-20041218.tar.bz2>


> RSA: The compiler already does a good job with 64-bit arithmetic.

there's potential to improve DSA on 32-bit x86 with further application of 
the sse2 techniques already committed to 32-bit x86 for RSA ... basically 
the special-purpose bn squaring code can be extended as well.  it'll still 
be nowhere near as good as 64-bit native though, interesting only to folks 
still on 32-bit platforms.


> My first step will be to study the only existing AMD64 implementation of
> AES: loop-aes, merged in Linux kernel 2.6.8-rc3 by Brian Gladman.

yeah gladman aes is the way to go ... the gladman code and linux-kernel 
variations on it (including x86-64 port) are really well balanced across 
cpus, and don't do anything obscene like use huge tables or use %esp as a 
temp register (ScienceMark AES windoze benchmark uses %esp as a temp 
register!  it saves %esp off in a global for the duration... definitely 
not safe in portable code, and only works on windoze 'cause they don't 
have signal frames on the stack).

i know how i could do a better job natively on efficeon using tables only 
twice as large as gladman, but that's also "breaking the rules" :)

-dean
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [EMAIL PROTECTED]
Automated List Manager                           [EMAIL PROTECTED]

Re: MD5 optimized for AMD64 (+65% speedup)

Reply via email to