I'm afraid I was not clear enough in my reporting.
>And indeed,
> committed code performs
> virtually as fast as ECB [naturally on aligned
> input].
1. At the time of last post I have not noticed your
patch yet. My confidence in vector intrinsics was
caused by the fact that I've got best results using
unsigned long long data type, so I hoped that 128b
datatype on SSE2 processor will do even better (due to
absence of extra check in loop and whole block
load/store). Now as I checked this throughly I'm
actually very happy with your patch.
2. About my testing: I was actually testing a
performace of this IO loop:
{
sz = read(fd_in, buf, buf_size);
AES_cfb128_encrypt(buf, buf, sz,...);
write(fd_out, buf, sz);
}
unsing different block sizes, as I found that
cfb128_encrypt have considerable effect on throughput.
I used a large input (~100MB) and many runs. I tested
it on old P4, 1.6GHz (pre-SSE2 model). With block size
of 32k I found that stock openssl runs at 36/30 MB/sec
(enc/dec), your new implementation - 44/41 MB/sec (my
gcc intrinsics attempt being slightly slower) and my
"unsigned long long" version running 51/50 MB/sec
(everything else being the same). Later, I tested the
whole batch on the Turion64 and found that SSE2
produces no advantage over your code, especially
thanks to size_t being 64b (I tested SSE2 inline
assembler implementation too).
I also found that the IO loop has better performance
than enc/dec of large memory block (size > 1M), for
some reason. The enc/dec of short memory blocks (say
32k) gives higher rate, of course, but I prefer
sustained rate as measure.
My initial posted estimate was based on the 50MB/30MB
figure I obtained when switching decrypting code to
use "unsigned long long" (my first attempt). When I
switched to vector code, hoping to get some gain from
SSE2, I had to lower my estimate to observed 41MB/30MB
one.
> If absolute performance is of such great concern,
> why not RC4 then?
3. The problem with RC4 (which I, of course,
considered) is it's name. I'm not judging it merits,
simply "RC4? Wsn't it used in ..." (export
IE/WEP/other bad things) is much inferior to "Oh, yes!
AES!" kind of name.
> ??? PowerPC handles unaligned load/stores in
> hardware[*],
4. I confess, I haven't noticed that "-mstrict-align"
is defined only for m68k, ppc and i960 (I ever used
only former two). Even for such cases as unaligned
load/store of floats (I think motorola cpus always
throw exceptions for those) gcc emits correct code
that automatically re-aligns the float for
exception-less load/store. Of course, you can trust
exception handler to handle this, but "-mstrict-align"
is really handy, especially on embedded platforms.
Unfortunately, I don't have ppc gcc installed right
now, but this:
struct x {
char a;
double b;
} __attribute__((packed));
void f() {
struct x;
double c = 5.0;
x.b = c;
}
will generate quite a nifty code when compiled with
"-mstrict-align" (I used this to write a communication
protocol driver for an embedded controller some time
ago).
Can you please point me to the deprecation
announcement of the gcc vector intrinsics, you've
mentioned? I think they are rather interesting, and I
found nothing about them being deprecated in the gcc-4
changelog.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [email protected]
Automated List Manager [EMAIL PROTECTED]