Re: [9fans] speaking of kenc

Paul Lalonde Fri, 04 May 2007 15:27:23 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On May 4, 2007, at 2:58 PM, Dave Eckhardt wrote:

One possible goal might be a language in which you could
describe high-level algorithms of a certain class which
could then be compiled to run well on a Cell (and, to be
a cool result, on some other thing).  This would probably
handle not just computation but also the necessary DMA
to get the data ready.

FWIW, the C++ template goop that I use in my SPU code is all aboutmasking the data movements - you don't want virtual-function calloverhead in cache-lookup functions, nor do you want a differentversion of the code for each data type you want to transfer. Thereis a relatively limited number of buffer usage patterns. Inapproximate best to worst performance order these are double-bufferedinput and output, block-random access, struct-sized random access,and general pointer-chasing. I can easily wrap a small languagearound these operations (and have in the past - it's just moreconvenient right now to let GCC maintain it for me).

Failing that, it seems like what people will be doing for
a while is writing code carefully tuned to run well on
exactly one or two particular models of Cell, which seems
to me likely to look like carefully optimized "inner loop"
stuff wrapped by glue code which matters less.

Only partly true; the SPU architecture defines the latencies andstalls of the various instructions fairly well. Given my experienceoptimizing SPU code, the 40:1 to 100:1 improvements from datarestructuring and selective SIMD conversions are worth doing, whilethe per-cycle stall management isn't - there might be another factorof 2, or there might not - it's a difficult space for a small reward.

I have to
wonder whether it would be less painful to learn the hardware
and write the optimized code in assembly language or to learn
the hardware *and* learn how to cajole a complicated compiler
into emitting the assembly language you know it should be
emitting.

Doing the streaming/caching/DMA code in assembly is a non-starter.It's just that increment of too complicated. And fortunately, IBMwent and defined the C language extensions as part of the SPUarchitecture, which means it's not too hard to learn to use. Therestrict keyword does gall me though.

With respect to kencc, I wonder how far you could get if
each Cell vector instruction were a C-callable .s function
of a few instructions and the SPU linker routinely inlined
all small-instruction-count functions and had an optimizer
explicitly designed for the SPU.

I think this could work quite well; I'm not sure how this interfereswith register allocation though. I'll give it some thought. Theharder part will be the data movement operations from my firstparagraph.


Paul


Dave Eckhardt


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)

iD8DBQFGO7M1pJeHo/Fbu1wRAmFvAKDUlDdofVlXv30Lcf3xYPHN6ubX4QCfclYB
te5F+PL5KW2BiF+CvXzyuDQ=
=HTyI
-----END PGP SIGNATURE-----

Re: [9fans] speaking of kenc

Reply via email to