> But I'm curious, why there is such a drop in performance of asm code and

In C case unrolled loop is entered for lengths of 8 bytes and beyond. In
assembler optimized loop is engaged for lengths larger than 32. One
should keep in might that RC4 is very sensitive to architectural
characteristics. The fact that C code was faster had lesser to do with
quality of compiler-generated code, but with the fact that pre-Sandy
Bridge hardware was "confusing" itself on long blocks. As mentioned in
rc4-586.pl performance vs. block size had [quite a] maximum at 64 bytes.
Then assembler performed poorer because it was using compact byte-based
key schedule, while C - word-based, and degree of "confusion" was in
reverse proportion to key schedule size.

> what can be done to address that issue?

As Peter implied the question is if it's worth the effort. Say you
improve small block performance by 60%. But if the operation in question
takes only 10%, then netto effect would by 6%. Well, I can have a look,
but please don't hold your breath.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Reply via email to