I have a reasonable algorithm for computing p^2 mod f. I have a sketch for the 
code sufficient to estimate timings but it hasn't yet been fully coded or tested. 
It's *much* more complex than the simple code I posted yesterday.

The approximate timing is 3,250 clocks. It will do 2 * p^2 mod f for only about 15 
clocks extra. It will go to 96 bits for an extra 600 clocks. There is scope for a
version limited to 72 bits which would be about 250 clocks faster than the 80 bit 
version.

The algorithm consists of two parts: effectively, a long multiplication in
base 2^32 and a long division in base 2. The long multiplication gains very little 
or nothing from any restriction in the value of f, therefore a full 96 bit long 
multiplication is done.

Let p0 be the low 32 bits of p, p1 the middle 32 bits and p2 the high 32 bits.
When I say "word" I mean 32 bits.

1. Using the FPU, calculate p0*p0, p0*p1, p0*p2, p1*p1, p1*p2 and p2*p2, storing 
each result in a seperate 64-bit work area. This appears to be best done using the 
floating point unit on the processor. Everything else uses only the ALU.

2. Form the result in a 192-bit "accumulator" as follows.
        1st word = low word of p0*p0
        2nd word = high word of p0*p0 + 2*low word of p0*p1
        3rd word = 2*high word of p0*p1 + low word of p1*p1 + 2*low word of p0*p2
        4th word = high word of p1*p1 + 2*high word of p0*p2 + 2*low word of p1*p2
        5th word = 2*high word of p1*p2 + low word of p2*p2
        6th word = high word of p2*p2
.. propogating all carries from one word to the next.
This accumulator has to be stored in memory, there aren't enough registers :-(
>From now on, when I say "accumulator", I mean this memory buffer, not a CPU 
register.

(There's scope for optimization here. Can start forming the result while the FPU 
is still working on the later partials; this should allow some of the operations
to be "free" since independent ALU and FPU operations are done in parallel.)

3. If f has 80 or less significant bits (high 16 bits of f2 = 0), left shift the
accumulator by 16 bits & set the division loop count to 80, else set the division 
loop count to 96.

(Scope for further optimization. If f2 has 72 or less significant bits, or more 
than 80 but less than or equal to 88 significant bits, left shift the accumulator 
by 8 bits & reduce the division loop count by 8)

Note, f is frequently referenced in the division loop; store it in registers. The 
loop count is infrequently referenced & can be stored in memory.

4. Repeat until division loop count = 0:
        4a. Subtract f from the high 96 bits of the accumulator, if this is
            possible without generating a borrow from the MSB.
        4b. Left shift the whole accumulator by 1 bit.
        4c. Decrement the division loop count.

5. If, but only if, the operation required is 2*p^2 mod f, subtract f from the 
high 96 bits of the accumulator, if this is possible without generating a borrow 
from the MSB.

The required result is now in the high 96 bits of the accumulator.

The bulk of the time is spent in the loop in step 4. Although the exact timing for 
each iteration depends on whether or not the subtraction "goes", the effect is 
small. There are two problems - speculative precalculation is not helpful, since 
there is heavy dependence of each instruction on the previous one, and rotating 
192 bits through memory requires 6 loads, 6 rotates and 6 stores. I estimate the 
loop timing as 1 clock per instruction, not including the rotates, plus 18 clocks 
for the rotates. It comes out to approx. 39 clocks per iteration.

Keeping f in memory and using CPU registers for the 3 most significant words of 
the accumulator may be better.

Reply via email to