Hi,

> --- As you suggested, I made some benchs to compare the mips-mont and
>  mips32.S asm functions. It appears that indeed mips-mont.pl func is
> better for small 512 rsa key signing, but generic C montgomery with
> optimized bn_xxx_word file is quite better from 1024 verifing. From
> 2048 on it is really faster (about +33%). (please see attached file
> for results and used options).

For reference. My conclusion was based on following data. On my 300MHz
R5000 system I observe ~40% improvement on 4096-bit RSA sign for
mips-mont.pl over compiler-generated code. In mips32.S you report ~50%
improvement on same benchmark. Therefore reasoning was that mips-mont.pl
would be only *nominally* slower than mips32.S on longest keys and
faster otherwise. Thus it can/should be preferred. The conclusion was
*not* based on actual mips32.S benchmark, because suggested mips32.S
miserably fails to compile on my [IRIX] system. Well, it was apparently
wrong... I mean 33% is hardly "nominal"... I suppose I'll have to see if
I can manage to compile mips32.S...

> --- For the comba inline asm, the idea is to hide the operands'
> loading and the mult latency, it is important not to ask gcc to read
> the mac regs by itself just after the mult function.

You underestimate compiler. Consider following macro:

#define mul_add_c(a,b,c0,c1,c2) {       \
        t = (BN_ULLONG)a*b;             \
        asm ("clrt\n"                   \
"       addc    %3,%0\n"                \
"       addc    %4,%1\n"                \
"       addc    %5,%2\n"                \
        : "+r"(c0), "+r"(c1), "+r"(c2)  \
        : "r"((BN_ULONG)t),             \
          "r"((BN_ULONG)(t>>32)),"r"(0) \
        : "t");\
}

Here is fragment of compiler-generated code (gcc 3.4.6 -O3):

        sts     macl,r9
        sts     mach,r11
        dmulu.l r13,r1
        clrt
        addc    r9,r7
        addc    r11,r8
        addc    r3,r0

        mov.l   @(16,r6),r9
        mov.l   @(16,r5),r12
        sts     macl,r11
        sts     mach,r2
        dmulu.l r9,r10

But it's probably of lesser relevance, because I'd rather just omit
comba routines. Indeed...

> -- the -DSMALL_FOOT_PRINT don't make much difference on the mips32
> CPU I use (512K L2). On SH4 without L2, the -DSMALL_FOOT_PRINT is
> some % lower, but there is no major difference.. => I will have to
> check this point since the speed difference between pure C and
> inlined asm comba funcs is important for SH4, and used to be really
> faster in 0.9.8o.

Idea behind suggestion is to minimize amount of things that can go
wrong. I mean if OPENSSL_SMALL_FOOTPRINT doesn't affect performance very
much, but allows to drop *bulk* code that needs maintenance (inline
assembler is slippery business:-), then it should be just done. Those
last percents don't worth the time one saves.

> --- For aes and compressed table: On sh4 arch the ALU pipeline the
> the most loaded one, so I'm afraid that replacing mov.l with
> mov.l+swap would lower the perfs quite a bit. The low register count
> and the high L1 latency will also make the optimisation less
> efficient. => I really think that for SH4, the current 4KB uint32
> table implementation is the more efficient.

It's *one* extra swap per "iteration." Nobody said that performance
won't be affected, only that it's likely to/would be *adequate*. I.e.
likely to be faster than compiler-generated code, just not as fast as
asymptotic limit, yet more secure[!]. Once again, sheer performance is
not always the primary goal.

> => Would you have any information on the MIPS pipeline internals to have
> a better view?

No, sorry. One can find some information for processors used by SGI,
R4000 and R10000, but haven't seen anything about embedded CPUs. A.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to