>>>> The techniques used in this plain v9 implementation are:
>>>>
>>>> 1) Use little-endian 32-bit loads when input data is aligned.
>>>> 2) Avoid having to accumulate into the context hash values every
>>>>    loop iteration.
>>>> 3) In the aligned case try to seperate the loads from the first
>>>>    use by as many instructions as possible, without sacrificing
>>>>    the schedule too much.
>>>> 4) Attempt to dual-issue as much as possible on UltraSPARC-I/II/III/IV
>>>>    and SPARC-T4.
>>> I had an old module lying around, dusted it off in
>>> http://cvs.openssl.org/chngview?cn=22842. It's 20% faster than your
>>> version on US pre-Tx. Improvement coefficient is likely to be even
>>> higher on T1, because it keeps everything in register bank and there
>>> are no loads except for input. Not really relevant, but it's nominally
>>> faster even on T4.
>>
>> Could you discuss something like this before checking in such
>> changes instead of just silently dismissing work I've posted?
> 
> All right, will do.

Just for reference. Besides already mentioned minimization of loads the
provided alternative code works around architectural asymmetries in
legacy UltraSPARCs by maintaining odd number of integer instructions
between shifts. This is what makes it perform better on US-I-IV. This
also "pushes up" some instructions to previous round and as result there
are few redundant instructions in last round.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to