Jason Papadopoulos wrote:

>> The alpha was already at least 5x faster than a PIII for multiprecision
>> arithmetic at the same clock speed; with the P4 it will only get worse.

Brian Beesley replied:

> Are you sure about this? I think, with Alpha, you have to execute the 
> instruction twice to get a double-precision multiplication - you can 
> store either the low half or the high half of a product in one 
> instruction, but not both.

True, you need two instructions to get both halves of a 64x64 ==> 128-bit
product on the Alpha. It always seemed wasteful to me to have an instruction
(e.g. Alpha umulh) to get the upper half a product, which simply discards
the lower half - by way of contrast, MIPS dmult calculates the full 128-bit
product in one shot, returning the result in *two* 64-bit integer registers -
but apparently the Alpha designers wanted all instructions to be in 2-input,
one-output-register format, no exceptions. You might say they felt it would
be less than RISCy to do otherwise.

Now, on the Alpha 21064, integer muls are unpipelined and slow, so getting
a 128-bit product takes around 30 cycles. On the 21164 imuls are partially
pipelined - one can start at most every 8 cycles - so a double-width product
needs around 16 cycles, very close to what the MIPS needs using its unpipelined
dmultu instruction. Now on the Alpha 21264, imuls still have fairly long
latency (I believe around 6-8 cycles) but are *fully* pipelined, so even
with the need to use 2 instructions to do it, one can effectively get a
128-bit product every 2 cycles, which really screams.

The IA-64 will (from what I've read) actually be very much like the 21264
in this respect: 2 instructions for the separate halves of a double-wide
integer product (the IA-64 analog of Alpha umulh is called xma.hu), but
here the muls will use the x86-style floating multiplier, whose 64-bit
mantissa is of convenient length for 64-bit integer multiply. Thus, imuls
will benefit from the fact that the fmul unit is already designed to have
fairly low latency and, more importantly, is highly pipelined.

How is any of this relevant to Mersenne testing? Well, fast 64-bit integer
ops should make a well-designed all-integer convolution algorithm competitive
with a floating-point-FFT-based one. However, neither of these two is truly
satisfying since each basically uses just half the functionality (integer
or floating-point) of the processor. We really need an algorithm that does
floating and integer convolutions in parallel and uses some method (e.g.
Chinese remainder theorem or some other form of error correction) to
recorrelate the two result vectors at the end of each convolution step.
Such methods are feasible, but highly nontrivial to implement, and even
more challenging to implement efficiently. Stay tuned...

-ernst

_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to