Hi,
> --- As you suggested, I made some benchs to compare the mips-mont and
> mips32.S asm functions. It appears that indeed mips-mont.pl func is
> better for small 512 rsa key signing, but generic C montgomery with
> optimized bn_xxx_word file is quite better from 1024 verifing. From
> 2048 on it is really faster (about +33%). (please see attached file
> for results and used options).
For reference. My conclusion was based on following data. On my 300MHz
R5000 system I observe ~40% improvement on 4096-bit RSA sign for
mips-mont.pl over compiler-generated code. In mips32.S you report ~50%
improvement on same benchmark. Therefore reasoning was that mips-mont.pl
would be only *nominally* slower than mips32.S on longest keys and
faster otherwise. Thus it can/should be preferred. The conclusion was
*not* based on actual mips32.S benchmark, because suggested mips32.S
miserably fails to compile on my [IRIX] system. Well, it was apparently
wrong... I mean 33% is hardly "nominal"... I suppose I'll have to see if
I can manage to compile mips32.S...
> --- For the comba inline asm, the idea is to hide the operands'
> loading and the mult latency, it is important not to ask gcc to read
> the mac regs by itself just after the mult function.
You underestimate compiler. Consider following macro:
#define mul_add_c(a,b,c0,c1,c2) { \
t = (BN_ULLONG)a*b; \
asm ("clrt\n" \
" addc %3,%0\n" \
" addc %4,%1\n" \
" addc %5,%2\n" \
: "+r"(c0), "+r"(c1), "+r"(c2) \
: "r"((BN_ULONG)t), \
"r"((BN_ULONG)(t>>32)),"r"(0) \
: "t");\
}
Here is fragment of compiler-generated code (gcc 3.4.6 -O3):
sts macl,r9
sts mach,r11
dmulu.l r13,r1
clrt
addc r9,r7
addc r11,r8
addc r3,r0
mov.l @(16,r6),r9
mov.l @(16,r5),r12
sts macl,r11
sts mach,r2
dmulu.l r9,r10
But it's probably of lesser relevance, because I'd rather just omit
comba routines. Indeed...
> -- the -DSMALL_FOOT_PRINT don't make much difference on the mips32
> CPU I use (512K L2). On SH4 without L2, the -DSMALL_FOOT_PRINT is
> some % lower, but there is no major difference.. => I will have to
> check this point since the speed difference between pure C and
> inlined asm comba funcs is important for SH4, and used to be really
> faster in 0.9.8o.
Idea behind suggestion is to minimize amount of things that can go
wrong. I mean if OPENSSL_SMALL_FOOTPRINT doesn't affect performance very
much, but allows to drop *bulk* code that needs maintenance (inline
assembler is slippery business:-), then it should be just done. Those
last percents don't worth the time one saves.
> --- For aes and compressed table: On sh4 arch the ALU pipeline the
> the most loaded one, so I'm afraid that replacing mov.l with
> mov.l+swap would lower the perfs quite a bit. The low register count
> and the high L1 latency will also make the optimisation less
> efficient. => I really think that for SH4, the current 4KB uint32
> table implementation is the more efficient.
It's *one* extra swap per "iteration." Nobody said that performance
won't be affected, only that it's likely to/would be *adequate*. I.e.
likely to be faster than compiler-generated code, just not as fast as
asymptotic limit, yet more secure[!]. Once again, sheer performance is
not always the primary goal.
> => Would you have any information on the MIPS pipeline internals to have
> a better view?
No, sorry. One can find some information for processors used by SGI,
R4000 and R10000, but haven't seen anything about embedded CPUs. A.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [email protected]
Automated List Manager [email protected]