-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On May 4, 2007, at 2:58 PM, Dave Eckhardt wrote:
One possible goal might be a language in which you could
describe high-level algorithms of a certain class which
could then be compiled to run well on a Cell (and, to be
a cool result, on some other thing).  This would probably
handle not just computation but also the necessary DMA
to get the data ready.

FWIW, the C++ template goop that I use in my SPU code is all about masking the data movements - you don't want virtual-function call overhead in cache-lookup functions, nor do you want a different version of the code for each data type you want to transfer. There is a relatively limited number of buffer usage patterns. In approximate best to worst performance order these are double-buffered input and output, block-random access, struct-sized random access, and general pointer-chasing. I can easily wrap a small language around these operations (and have in the past - it's just more convenient right now to let GCC maintain it for me).

Failing that, it seems like what people will be doing for
a while is writing code carefully tuned to run well on
exactly one or two particular models of Cell, which seems
to me likely to look like carefully optimized "inner loop"
stuff wrapped by glue code which matters less.

Only partly true; the SPU architecture defines the latencies and stalls of the various instructions fairly well. Given my experience optimizing SPU code, the 40:1 to 100:1 improvements from data restructuring and selective SIMD conversions are worth doing, while the per-cycle stall management isn't - there might be another factor of 2, or there might not - it's a difficult space for a small reward.

I have to
wonder whether it would be less painful to learn the hardware
and write the optimized code in assembly language or to learn
the hardware *and* learn how to cajole a complicated compiler
into emitting the assembly language you know it should be
emitting.

Doing the streaming/caching/DMA code in assembly is a non-starter. It's just that increment of too complicated. And fortunately, IBM went and defined the C language extensions as part of the SPU architecture, which means it's not too hard to learn to use. The restrict keyword does gall me though.

With respect to kencc, I wonder how far you could get if
each Cell vector instruction were a C-callable .s function
of a few instructions and the SPU linker routinely inlined
all small-instruction-count functions and had an optimizer
explicitly designed for the SPU.

I think this could work quite well; I'm not sure how this interferes with register allocation though. I'll give it some thought. The harder part will be the data movement operations from my first paragraph.

Paul


Dave Eckhardt

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)

iD8DBQFGO7M1pJeHo/Fbu1wRAmFvAKDUlDdofVlXv30Lcf3xYPHN6ubX4QCfclYB
te5F+PL5KW2BiF+CvXzyuDQ=
=HTyI
-----END PGP SIGNATURE-----

Reply via email to