> But I'm curious, why there is such a drop in performance of asm code and
In C case unrolled loop is entered for lengths of 8 bytes and beyond. In assembler optimized loop is engaged for lengths larger than 32. One should keep in might that RC4 is very sensitive to architectural characteristics. The fact that C code was faster had lesser to do with quality of compiler-generated code, but with the fact that pre-Sandy Bridge hardware was "confusing" itself on long blocks. As mentioned in rc4-586.pl performance vs. block size had [quite a] maximum at 64 bytes. Then assembler performed poorer because it was using compact byte-based key schedule, while C - word-based, and degree of "confusion" was in reverse proportion to key schedule size. > what can be done to address that issue? As Peter implied the question is if it's worth the effort. Say you improve small block performance by 60%. But if the operation in question takes only 10%, then netto effect would by 6%. Well, I can have a look, but please don't hold your breath. ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org