http://www.cnn.com/2002/TECH/industry/05/21/supercomputing.future.idg/index.html
The theme of reducing transistor count without sacrificing much performance
is an interesting one. Some obvious possibilities I can think of, related to the
way typical CPUs do hardware arithmetic:
1) For floating-point add, have a dedicated unit to do exponent extraction
and mantissa shift count generation, then do the actual add in the integer
adder (there are often multiple integer adders on the chip, so this need
not stall genuine integer adds that are occurring at the same time).
2) Similarly, use the intger multiplier to multiply the 2 floating mantissas
in an FMUL together. For IEEE doubles this would need to generate the upper
53 bits of a 106-bit product, but many CPUs already can generate a full
128-bit integer product (this is useful for cryptographic applications,
among other things), so one could use the same hardware for both.
3) Have the compiler look for operations that can be streamlined at the
hardware level. For example, a very common operation sequence in doing
Fourier (and other) transforms is the pair
a = x + y
b = x - y .
If these are floating-point operands, one would need to do the exponent
extract and mantissa shift of x and y just once, and then do an integer
add and subtract on the aligned mantissa pair. It might even be possible
to do a pairwise integer add/sub cheaper than 2 independent operations
at the hardware level (for instance, the 2 operators need only be loaded
once, if the hardware permits multiple operations without intervening loads
and stores (and yes, this does run counter to the typical load/store RISC
paradigm).
4) Emulate complex functions, rather than adding hardware to support them.
For instance, square root and divide can both be done using just multiply
and add by way of a Newtonian-style iterative procedure. The downside is
that this generally requires one to sacrifice full compliance with the IEEE
standard, but hardware manufacturers have long played this kind of game,
anyway - offer full compliance in some form (possibly even via software
emulation), but relax it for various fast arithmetic operations, as needed.
5) Use a smallish register set (perhaps even use a single general-purpose
register set for both integer and floating data) and memory caches, but support a
variety of memory prefetch operations to hide the resulting main memory latency
insofar as possible.
Others can surely add to this list, and extend it to non-arithmetic functionality.
-Ernst
- Re: Mersenne: This supercomputer is cool EWMAYER
- Re: Mersenne: This supercomputer is cool Brian J. Beesley
- Mersenne: Re: This supercomputer is cool Steinar H. Gunderson