On Tuesday 21 May 2002 16:21, [EMAIL PROTECTED] wrote: > http://www.cnn.com/2002/TECH/industry/05/21/supercomputing.future.idg/index >.htm > > l > > The theme of reducing transistor count without sacrificing much performance > is an interesting one.
This is indeed interesting. The problem seems to be that the sustained floating-point performance of Transmeta chips seems to be at best only similar to that of PIII or Athlon chips when scaled by the power consumption of the chip itself. For our purposes, the transistor-heavy SSE2 unit on the P4 gives a much larger performance improvement than the resulting increase in power consumption. Add on the power consumed by support devices (chipset, memory etc) and the Transmeta design doesn't look too effective. In a situation where _on average_ only around 10% of the actual peak performance is required, the Transmeta design has considerable advantage, due to its capability to idle on very low current. >From the ecological point of view, one could easily make gains by using the "waste" heat from processors - ability to "pipe" it into building heating systems (at least the "hot" side of a heat pump) would be more useful than dispersing it locally as hot air. > Some obvious possibilities I can think of, related > to the > way typical CPUs do hardware arithmetic: > > 1) For floating-point add, have a dedicated unit to do exponent extraction > and mantissa shift count generation, then do the actual add in the integer > adder (there are often multiple integer adders on the chip, so this need > not stall genuine integer adds that are occurring at the same time). > > 2) Similarly, use the intger multiplier to multiply the 2 floating > mantissas in an FMUL together. For IEEE doubles this would need to generate > the upper 53 bits of a 106-bit product, but many CPUs already can generate > a full 128-bit integer product (this is useful for cryptographic > applications, among other things), so one could use the same hardware for > both. Most of the transistors in a typical modern CPU die are associated with cache & multiple parallel execution units. Cutting out one execution unit - one of the simpler ones at that - probably wouldn't save much power. > > 3) Have the compiler look for operations that can be streamlined at the > hardware level. For example, a very common operation sequence in doing > Fourier (and other) transforms is the pair > > a = x + y > b = x - y . > > If these are floating-point operands, one would need to do the exponent > extract and mantissa shift of x and y just once, and then do an integer > add and subtract on the aligned mantissa pair. It might even be possible > to do a pairwise integer add/sub cheaper than 2 independent operations > at the hardware level (for instance, the 2 operators need only be loaded > once, if the hardware permits multiple operations without intervening loads > and stores (and yes, this does run counter to the typical load/store RISC > paradigm). This is a Good Idea and would cost very little extra silicon. There is however a synchronization problem resulting from always doing add & subtract in parallel (& discarding the extra result when only one is needed). This is because (when operands have the same sign) renormalizing after addition requires at most one bit shift in the mantissa, whereas after subtraction one may require a large number of bit shifts; indeed we may even end up with a zero result. Fixing this sychronization problem requires either extra silicon in the execution unit, or a more complex pipeline. > > 4) Emulate complex functions, rather than adding hardware to support them. > For instance, square root and divide can both be done using just multiply > and add by way of a Newtonian-style iterative procedure. The downside is > that this generally requires one to sacrifice full compliance with the IEEE > standard, but hardware manufacturers have long played this kind of game, > anyway - offer full compliance in some form (possibly even via software > emulation), but relax it for various fast arithmetic operations, as needed. Yes. That's why FP divide is so slow on Intel chips. > > 5) Use a smallish register set (perhaps even use a single general-purpose > register set for both integer and floating data) and memory caches, but > support a > variety of memory prefetch operations to hide the resulting main memory > latency > insofar as possible. I get suspicious here. Extra "working" registers enable the compiler to generate efficient code easily, and (given that memory busses are so much slower than internal pipelines) bigger caches always repay handsomely in terms of system performance. (Incidentally there may be an impact on Prime95/mprime here. The new Intel P4-based Celerons have 128KB L2 cache; I believe the SSE2 code is optimized for 256KB L2 cache, so running SSE2 on P4 Celerons may be considerably less efficient than it might be.) > > Others can surely add to this list, and extend it to non-arithmetic > functionality. A few weeks ago there was an argument based on making graphics chips programmable. This would enable massive parallelization of SSE type code. Of course we really need double-precision, but it's an interesting idea. Hardware X float (real*16) would also be of considerable assistance to us; unfortuately it's unlikely to appear in "consumer volume" (read, affordable) chipsets as it's more or less irrelevant to many tasks. Bit-sliced integer operations would enable savings in memory traffic and also allow some low-precision integer operations to be vectorized much more efficiently. However, most compilers would need to get considerably more complex to allow full use of this facility. Regards Brian Beesley _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
