Mersenne: Re: This supercomputer is cool

2002-05-22 Thread Steinar H. Gunderson

On Wed, May 22, 2002 at 11:36:11AM +, Brian J. Beesley wrote:
>A few weeks ago there was an argument based on making graphics chips 
>programmable. This would enable massive parallelization of SSE type code. Of 
>course we really need double-precision, but it's an interesting idea.

How massive is "massive"? Most graphics hardware today can "only" texture
one, two or four pixels per clock -- although they can do a lot of work
quickly, they definitely don't do hundreds of operations at a time :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: This supercomputer is cool

2002-05-22 Thread Brian J. Beesley

On Tuesday 21 May 2002 16:21, [EMAIL PROTECTED] wrote:
> http://www.cnn.com/2002/TECH/industry/05/21/supercomputing.future.idg/index
>.htm
>
> l
>
> The theme of reducing transistor count without sacrificing much performance
> is an interesting one. 

This is indeed interesting. The problem seems to be that the sustained 
floating-point performance of Transmeta chips seems to be at best only 
similar to that of PIII or Athlon chips when scaled by the power consumption 
of the chip itself. For our purposes, the transistor-heavy SSE2 unit on the 
P4 gives a much larger performance improvement than the resulting increase in 
power consumption. Add on the power consumed by support devices (chipset, 
memory etc) and the Transmeta design doesn't look too effective.

In a situation where _on average_ only around 10% of the actual peak 
performance is required, the Transmeta design has considerable advantage, due 
to its capability to idle on very low current.

>From the ecological point of view, one could easily make gains by using the 
"waste" heat from processors - ability to "pipe" it into building heating 
systems (at least the "hot" side of a heat pump) would be more useful than 
dispersing it locally as hot air.

> Some obvious possibilities I can think of, related
> to the
> way typical CPUs do hardware arithmetic:
>
> 1) For floating-point add, have a dedicated unit to do exponent extraction
> and mantissa shift count generation, then do the actual add in the integer
> adder (there are often multiple integer adders on the chip, so this need
> not stall genuine integer adds that are occurring at the same time).
>
> 2) Similarly, use the intger multiplier to multiply the 2 floating
> mantissas in an FMUL together. For IEEE doubles this would need to generate
> the upper 53 bits of a 106-bit product, but many CPUs already can generate
> a full 128-bit integer product (this is useful for cryptographic
> applications, among other things), so one could use the same hardware for
> both.

Most of the transistors in a typical modern CPU die are associated with cache 
& multiple parallel execution units. Cutting out one execution unit - one of 
the simpler ones at that - probably wouldn't save much power.
>
> 3) Have the compiler look for operations that can be streamlined at the
> hardware level. For example, a very common operation sequence in doing
> Fourier (and other) transforms is the pair
>
> a = x + y
> b = x - y .
>
> If these are floating-point operands, one would need to do the exponent
> extract and mantissa shift of x and y just once, and then do an integer
> add and subtract on the aligned mantissa pair. It might even be possible
> to do a pairwise integer add/sub cheaper than 2 independent operations
> at the hardware level (for instance, the 2 operators need only be loaded
> once, if the hardware permits multiple operations without intervening loads
> and stores (and yes, this does run counter to the typical load/store RISC
> paradigm).

This is a Good Idea and would cost very little extra silicon. There is 
however a synchronization problem resulting from always doing add & subtract 
in parallel (& discarding the extra result when only one is needed). This is 
because (when operands have the same sign) renormalizing after addition 
requires at most one bit shift in the mantissa, whereas after subtraction one 
may require a large number of bit shifts; indeed we may even end up with a 
zero result. Fixing this sychronization problem requires either extra silicon 
in the execution unit, or a more complex pipeline.
>
> 4) Emulate complex functions, rather than adding hardware to support them.
> For instance, square root and divide can both be done using just multiply
> and add by way of a Newtonian-style iterative procedure. The downside is
> that this generally requires one to sacrifice full compliance with the IEEE
> standard, but hardware manufacturers have long played this kind of game,
> anyway - offer full compliance in some form (possibly even via software
> emulation), but relax it for various fast arithmetic operations, as needed.

Yes. That's why FP divide is so slow on Intel chips. 
>
> 5) Use a smallish register set (perhaps even use a single general-purpose
> register set for both integer and floating data) and memory caches, but
> support a
> variety of memory prefetch operations to hide the resulting main memory
> latency
> insofar as possible.

I get suspicious here. Extra "working" registers enable the compiler to 
generate efficient code easily, and (given that memory busses are so much 
slower than internal pipelines) bigger caches always repay handsomely in 
terms of system performance.

(Incidentally there may be an impact on Prime95/mprime here. The new Intel 
P4-based Celerons have 128KB L2 cache; I believe the SSE2 code is optimized 
for 256KB L2 cache, so running SSE2 on P4 Celerons may be considerably less 
efficient than it might be.)
>
> Others can surely add to this list,