Hi,

> Due to the complete missing of an optimization in the IBM proposal 
> I am currently working on a GCM version as well. My current work 
> includes: 
> 
> - EVP support for the CTR128 modes *1) (AES and Camellia),

AES counter was recently added, see
http://cvs.openssl.org/chngview?cn=19314. Special note for reference. In
crypto/evp/e_aes.c you can find static aes_counter() function, which
*could* have been implemented as

        CRYPTO_ctr128_encrypt (in,out,len,
                &((EVP_AES_KEY *)ctx->cipher_data)->ks,
                ctx->iv,ctx->buf,&ctx->num,AES_encrypt);

Instead it's implemented as explicit

        AES_ctr128_encrypt (in,out,len,
                &((EVP_AES_KEY *)ctx->cipher_data)->ks,
                ctx->iv,ctx->buf,&ctx->num);

where AES_ctr128_encrypt in turn calls CRYPTO_ctr128_encrypt. The reason
for this is to *reserve* for option for assembler implementation of
AES_ctr128_encrypt. And the reservation is there solely because AES is a
popular algorithm. For other algorithms, e.g. Camellia, I'd insist on
calling CRYPTO_ctr128_encrypt directly from EVP. Same applies to
CRYPTO_gcm128_*, etc.

> as these 
> are required in the GCTR [SP800-38D] function of the GCM (instead of 
> a block-wise use of the ECB mode),

???

> - replacement of the byte-wise shift/xor/swap loops by using platform 
> selective enrolled macros for BE/LE 8/16/32/64 bit architectures,

Done in crypto/modes/gcm128.c for 32/64-bit architectures. As for
"narrower" platforms the support is discontinued in 1.0 and there is no
reason to believe that it was actually operational in 0.9.8 or even
earlier. In other words forget 16 bits.

> - versions with SSE2/SSE3 support (if OPENSSL_ia32cap signals a valid 
> processor), reducing the number of asm instructions within a loop to 
> 16 with a 4k table, and to 26 with a 256byte table (w/o: 34 and 62),

I was considering evaluating MMX/SSE at some point too. I have to
mention that the vision is to keep SSE support at absolutely required
minimum. I mean if there is SSEn+1 instruction that does something, I'd
rather *not* deploy it *unless* doing so gives at least 30% improvement
[over SSEn]. In other words absolute performance score is not the sole
goal, *versatility* is taken into consideration as well. In case you
consider submitting assembler code there is couple of requirements that
has to be met. Inline assembler (or exotic intrinsics) is not considered
as viable option for MMX/SSE (or any code bigger than couple of
instructions), perlasm code is. Code has to be position independent.

> - replacement of allocated tables by local (stack) tables (as the table 
> generation is now faster than the overhead for an alloc),

Good idea if we settle for "one-shot" interface...

> removal of the 
> 64k table mode (as it is inefficient due to cache misses), removal of 
> the 8k table mode (takes more instructions in the loop than an optimized 
> 4k table),

As for table sizes I'm not prepared to discuss it yet, benchmarking is
left to do... Though 64K result is not very surprising... nor 8K one...
But now we know, thanks! As already implied you can find 4K, 256B and
no-additional-table implementations in current gcm128.c.

> - better execution of multiple blocks within GCTR and GHASH [SP800-38D] 
> to optimize the use of local tables.
> 
> 
> To be done:
> 
> - a SSSE3/PSHUFB version (currently do not have a suited processor),

See above.

> - a PCLMULDQ version (same as above),

PCLMULDQ support will be added to AESNI engine (see
crypto/engine/eng_aesni.c and crypto/aes/asm/aesni-x86*).

> - redesign of the EVP interface.

I have no comment on this for the moment, thinking is left to do...

> It would be nice if we could bring our parts together to build and  
> test a version having all advantages together.

Absolutely.

> *1)Though the CTR128 modes have a full 128-bit counter and [SP800-38D] 
> specifies a 32-bit counter at LSB, there is a defined limit of 64 gigabytes 
> per invocation, which effectively prevents a counter overflow into the 
> 33rd bit. So, using the CTR128 instead of a CTR32 is possible.

Only when IV length is 96 bits, because only then least significant 32
bits of initial counter value contain small value, 1 to be specific.
Otherwise, i.e. for IV lengths other than 96 bits,  there is no
guarantee that value in those 32 bits is low enough to "accommodate" the
packet. This is basically the reason why gcm128.c has "own" counter
implementation, instead of relying on CRYPTO_ctr128_encrypt. But nothing
is written in stone and implementing say CRYPTO_ctr128_32_encrypt is not
ruled out [yet]... A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Reply via email to