- versions with SSE2/SSE3 support (if OPENSSL_ia32cap signals a valid processor), reducing the number of asm instructions within a loop to 16 with a 4k table, and to 26 with a 256byte table (w/o: 34 and 62),

Are these numbers really per loop spin, i.e. per every 8 and 4 bits respectively, or per byte for either? In another message you wrote that you observe "over 50% improvement" with SSE2 code. On which platform? Which compiler? Etc.

I've sketched *32-bit* integer and MMX (yes, pure MMX) gcm_gmult_4bit, i.e. one operating with 256 bytes table. MMX code was observed to process one byte in ~35 cycles on P4(*) and in ~22 cycles on Core2 and Opteron, which is ~2-3x faster that code generated by gcc. If compared to integer assembler MMX code was observed to be ~35% faster on Core2 and Opteron and 2.5x faster on P4. Latter is because I've chosen shrd for integer assembler and it just kills P4.

(*) CPU oscillator's *cycles* per byte, not instructions. In case you wonder I have 25 instructions per *byte* in MMX loop and 27 in integer loop.

A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to