Jeremy Blosser wrote:

>So for each squaring, we have x left shifts and adds, where x is no larger
>that p.

Hi, Jeremy:

A couple years ago I discussed the same issue with an old schoolfriend of
mine, who is a specialist in microelectronics. He said, sure, with a giant
binary shifter (perhaps implemented via a programmable logic array) one
could do this, but after crunching the numbers some more we both concluded
that for large Mersennes, this is not an effective alternative to a fast-
tranform-based squaring. Here's one way to think about it: imagine you
have two microprocessors, one doing general-purpose math (i.e. a typical
64-bit RISC) and doing Mersenne-mod multiply via FFT, the other a dedicated
shift-and-accumulate unit. Both will be assumed to operate at the same
clock frequency. The general-purpose unit might do an FFT-based squaring
using 64-bit floats, which allows input digits of nearly 20 bits (for
p up to the several millions), and will thus do around 10*(p/20)*log(p/20)
floating-point operations (the multiplier of 10 is a typical figure, based
on the cost of doing both a forward and an inverse FFT per squaring.

Now, for the shift-based chip, there are two ways to approach the squaring:

1) Distribute the multi-million-bit operands across many smaller memory
units (e.g. 64-bit memory fields) and have one or a few arithmetic units
shift each of these and accumulate the results. Even if we can do a full
shift-accumulate step (including any carry bit from the previous digit)
per cycle, this will need p*(p/64)/[nshifter] (i.e. O(p^2)) cycles, where
[nshifter] is the number of 64-bit shitf-accumulate units and these are
assumed to be executing in parallel. So, unless [nshifter] is very large,
it's going to be really slow. So, you say, let's just make [nshifter] very
large, and in the limit where [nshifter] is so large that all the operand
bits get shifted every cycle, one has

2) A giant shift-and-accumulate register, which needs just p cycles to
do the modular squaring irrespective of the size of p. i.e. beats even
the FFT-based squaring by a factor of O(log p) in terms of speed. The
thing that kills this approach is power consumption - even if it needs
just O(p) cycles to do a squaring, it's still doing O(p^2) bitwise
operations per squaring, and thus has a power dissipation that is O(p).
The FFT-based scheme, on the other hand, even assuming one is keeping
a 64-bit floating adder and multiplier completely busy, spreads
O(p log p) operations over the same order of cycles, i.e. has a constant
power dissipation that is order of unity.

Just imagine: if the O(1) power dissipation of the FFT-based scheme
still translates into a piece of silicon about 1 cm^2 dissipating
(say) 30 watts, how hot is a similar-sized piece of silicon dissipating
O(p)*30 watts going to get?

Now, if (or when) commodity hi-temperature-superconductor-based
microprocessors become available, the power draw becomes negligible,
and then scheme (2) perhaps becomes attractive. It certainly looks
easier to code!

-Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to