>> In case you consider submitting assembler code there is couple of 
>> requirements that has to be met. Inline assembler (or exotic intrinsics) 
>> is not considered as viable option for MMX/SSE (or any code bigger than 
>> couple of instructions), perlasm code is.
> 
> As it is available in the MSC, Intel and GCC compilers, I have realized it
> with intrinsics.

Well, they are not available in *all* MSC adn GCC versions, are they?
Another reason for favoring real assembler is that it's not uncommon
that you find yourself at compiler's mercy to produce efficient code and
performance can vary significantly from version to version. As we aim to
support quite a range of developer environments, there is no reason to
"punish" users with "wrong" compiler versions. Frankly it's troublesome
enough to even identify "wrong"/"right" compiler versions.

> Sorry, did not know of this OpenSSL policy. The current 
> construction is:
> 
> #if defined(_MSC_VER) && (defined(_M_IX86)  || defined(_M_AMD64)  || 
> defined(_M_X64))     || \
>     defined(__GNUC__) && (defined(__i386__) || defined(__amd64__) || 
> defined(__x86_64__)) || \
>     defined(__ICL)    ||  defined(_EMM_FUNCTIONALITY)
> #define EMM_INTRINSICS
> #endif
> 
> #define X86_CPUID_BIT_SSE2 0x04000000
> 
> #ifdef EMM_INTRINSICS
>   if(X86_CPUID_BIT_SSE2 & OPENSSL_ia32cap)
>   {
>     SSE2 version;
>   }
>   else
> #endif
>   {
>     classic version;
>   }
> 
> This should compile and execute on any system.

"Should" is not part of vocabulary in this context, "does" is, and the
only way to assure "does" in long run is perlasm.

>>> - replacement of allocated tables by local (stack) tables (as the table 
>>> generation is now faster than the overhead for an alloc),
>> Good idea if we settle for "one-shot" interface...
> 
> Even for one block a local 256-bytes table is much faster than IBM's 
> GCM_mult_noaccel() or B.Gladman's "slow field multiplier", both with 128 
> unpredictable branches.

But we have to weight it against own implementation(s), not somebody
else's...

> And with SSE2, the table build is such efficient 
> that it does not consume more cycles than one block. So, it will also be 
> suited for more than one-shot applications.

Cool. But once again, I have benchmarking left to do, so we have to
postpone this particular topic... A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Reply via email to