http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf
Snipet below
9.3 Double Precision FFT Performance
When DP is employed, the balance between memory and
computation is changed by a factor of 7. This pushes a
Double Precision (Gflop/s)
Performance of 1D and 2D FFT in DP (top)
and SP (bottom). For large FFTs, Cell is more than
10 times faster in SP than either the Opteron or
Itanium2. The Gflop/s number is calculated based
on a naive radix-2 FFT algorithm. For 2D FFTs the
naive algorithm computes 2N N-point FFTs.
slightly memory bound application strongly into the computationally
bound domain. The SP simultaneous FFT is
10 times faster than the DP version. On the upside, the
transposes required in the 2D FFT are now less than 20% of
the total time, compared with 50% for the SP case. Cellpm
+
finds a middle ground between the 4x reduction in computational
throughput and the 2x increase in memory traffic
increasing performance by almost 2.5x compared with the
Cell for all problem sizes.
9.4 Performance Comparison
The peak Cell FFT performance is compared to a number
of other processors in the Table 6. These results are conservative
given the naive 1D FFT implementation we used
on Cell whereas the other systems in the comparison used
highly tuned FFTW [7] or vendor-tuned FFT implementations
[25]. Nonetheless, in DP, Cellpm is at least 12x faster
than the Itanium2 for a 1D FFT, and Cellpm
+ could be as
much as 30x faster for a large 2D FFT. Cell+ more than
doubles the DP FFT performance of Cell for all problem
sizes. Cell performance is nearly at parity with the X1E in
double precision; however, we believe considerable headroom
remains for more sophisticated Cell FFT implementations.
In single precision, Cell is unparalleled.
Note that FFT performance on Cell improves as the number
of points increases, so long as the points fit within the
local store. In comparison, the performance on cache-based
machines typically reach peak at a problem size that is far
smaller than the on-chip cache-size, and then drops precipitously
once the associativity of the cache is exhausted and
cache lines start getting evicted due to aliasing. Elimination
of cache evictions requires extensive algorithmic changes for
the power-of-two problem sizes required by the FFT algorithm,
but such evictions will not occur on Cells softwaremanaged
local store. Furthermore, we believe that even for
problems that are larger than local store, 1D FFTs will con-
X1E FFT numbers provided by Crays Bracy Elton and
Adrian Tate.
tinue to scale much better on Cell than typical cache-based
superscalar processors with set-associative caches since local
store provides all of the benefits as a fully associative cache.
The FFT performance clearly underscores the advantages
of software-controlled three-level memory architecture over
conventional cache-based architectures.
From: John R Pierce <[EMAIL PROTECTED]>
Reply-To: The Great Internet Mersenne Prime Search list
<[email protected]>
To: The Great Internet Mersenne Prime Search list <[email protected]>
Subject: Re: [Prime] Cell processor
Date: Sun, 24 Sep 2006 23:47:50 -0700
Mikus Grinbergs wrote:
>I have not even looked at the specifications for the cell processor,
>but from general reading I have the impression that what it is really
>fast at is computing with 32-bit floating_point numbers. Doesn't
>GIMPS need to use larger numbers than that? [I believe computation with
>longer floating_point numbers is much slower on today's cell processor.]
>
>
Cell is a fast full featured 64bit PowerPC core with 8 satellite 'DSP'
type processors on the same chip. Programming it is apparently
quite a trip, since they have considerably different instruction set
architectures, IBM and Sony have had to come up with some novel
compilation tools to let you combine modules of the two different
processors into one executable.
http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
the DSP things (they call them SPE's) operate on 128 bit chunks which
are two double floats, or two 64 bit ints, or 4 single floats or 4 32bit
ints or 8 16bit... or 16 8bit... Each SPE has 128 128bit registers,
and 256kbyte of local memory which is used for core and data. The ONLY
access the SPE's have to main memory is via a DMA engine, which provides
a globally coherent view of memory and utilizes the host processors
MMU. The SPE's were totally designed for doing stuff like FFT's.
Each SPE can dispatch up to 2 instructions per cycle at 7 execution
units, with some tricky even/odd pipeline stuff.
you can now get a Cell in a Blade system module that can be installed in
a IBM BladeCenter... looks like they put two 3.2Ghz Cell's into a double
wide Blade, so you can put 7 of these dual Cells into a 7U chassis (and
probably 4KW of power :D). They have Fedora Core 5 based Linux running
on the Power core which supports SPE programming.
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime