Jeremy Blosser wrote: >So for each squaring, we have x left shifts and adds, where x is no larger >that p. Hi, Jeremy: A couple years ago I discussed the same issue with an old schoolfriend of mine, who is a specialist in microelectronics. He said, sure, with a giant binary shifter (perhaps implemented via a programmable logic array) one could do this, but after crunching the numbers some more we both concluded that for large Mersennes, this is not an effective alternative to a fast- tranform-based squaring. Here's one way to think about it: imagine you have two microprocessors, one doing general-purpose math (i.e. a typical 64-bit RISC) and doing Mersenne-mod multiply via FFT, the other a dedicated shift-and-accumulate unit. Both will be assumed to operate at the same clock frequency. The general-purpose unit might do an FFT-based squaring using 64-bit floats, which allows input digits of nearly 20 bits (for p up to the several millions), and will thus do around 10*(p/20)*log(p/20) floating-point operations (the multiplier of 10 is a typical figure, based on the cost of doing both a forward and an inverse FFT per squaring. Now, for the shift-based chip, there are two ways to approach the squaring: 1) Distribute the multi-million-bit operands across many smaller memory units (e.g. 64-bit memory fields) and have one or a few arithmetic units shift each of these and accumulate the results. Even if we can do a full shift-accumulate step (including any carry bit from the previous digit) per cycle, this will need p*(p/64)/[nshifter] (i.e. O(p^2)) cycles, where [nshifter] is the number of 64-bit shitf-accumulate units and these are assumed to be executing in parallel. So, unless [nshifter] is very large, it's going to be really slow. So, you say, let's just make [nshifter] very large, and in the limit where [nshifter] is so large that all the operand bits get shifted every cycle, one has 2) A giant shift-and-accumulate register, which needs just p cycles to do the modular squaring irrespective of the size of p. i.e. beats even the FFT-based squaring by a factor of O(log p) in terms of speed. The thing that kills this approach is power consumption - even if it needs just O(p) cycles to do a squaring, it's still doing O(p^2) bitwise operations per squaring, and thus has a power dissipation that is O(p). The FFT-based scheme, on the other hand, even assuming one is keeping a 64-bit floating adder and multiplier completely busy, spreads O(p log p) operations over the same order of cycles, i.e. has a constant power dissipation that is order of unity. Just imagine: if the O(1) power dissipation of the FFT-based scheme still translates into a piece of silicon about 1 cm^2 dissipating (say) 30 watts, how hot is a similar-sized piece of silicon dissipating O(p)*30 watts going to get? Now, if (or when) commodity hi-temperature-superconductor-based microprocessors become available, the power draw becomes negligible, and then scheme (2) perhaps becomes attractive. It certainly looks easier to code! -Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
