Re: [Haskell-cafe] optimising for vector units

Jan-Willem Maessen - Sun Labs East Wed, 28 Jul 2004 07:29:24 -0700

MR K P SCHUPKE wrote:

That was me.  I think you're underestimating the cost of starting
threads even in this very lightweight world.

Maybe... Perhaps haskell could be made to resemble dataflow instructions
more... If when a computation completes we insert the result directly
into the data structure which represents function, we can infact pick any
function for execution with _no_ overhead. (In a true dataflow system any
instruction from any thread can be executed with no overhead)

With architectural support for dataflow execution, this is true (though hardware support for scheduling necessarily imposes resource limitations). You might be rather interested in a couple of books about the Monsoon dataflow machine, and the compiler for the functional language (Id) used to prgram it---Id is similar in many ways to Haskell. The Haskell dialect pH was basically Id with Haskell syntax, and early versions used the Id back end to generate Monsoon code.

@book{102865,
 author = {Gregory M. Papadopoulos},
 title = {Implementation of a general-purpose dataflow multiprocessor},
 year = {1991},
 isbn = {0-262-66069-5},
 publisher = {MIT Press},
 }

@book{103033,
 author = {Kenneth R. Traub},
 title = {Implementation of non-strict functional programming languages},
 year = {1991},
 isbn = {0-262-70042-5},
 publisher = {MIT Press},
 }

Interestingly, even in a general-purpose dataflow system it became obvious that pushing data to and from memory was rather inefficient. As a result, Monsoon had registers to handle loop code, and ran inner loops sequentially. A discussion of the tradeoffs between sequential and parallel loops can be found here:

@inproceedings{165203, author = {Boon Seong Ang}, title = {Efficient implementation of sequential loops in dataflow computation}, booktitle = {Proceedings of the conference on Functional programming languages and computer architecture}, year = {1993}, isbn = {0-89791-595-X}, pages = {169--178}, location = {Copenhagen, Denmark}, doi = {http://doi.acm.org/10.1145/165180.165203}, publisher = {ACM Press}, }

The suggestion I guess is to only use instructions that get their arguments from main memory.

In the absence of architectural support for dataflow, this is the simplest option by far.

> This way any instruction can be sequenced with no overhead

on any CPU. With modern on-die core-speed caches this can be almost as fast
a registers (with good cache access patterns)

> > ... Note that I am only suggesting > interleaving instructions at the function level, so registers can be used within > functions... of course as things get more and more parallel we may see > hardware with no registers, just pipelined high speed cache access. (The > hardware may well use registers to pre-fetch cache values, but that can > be made transparent to the software)...

Alas, I think your making a few naive assumptions here: * All your functions must be strict in their arguments, yet your language as a whole is somehow non-strict. How? * Tracking which computations are partially satisfied has no overhead (we'll need to match arguments to computations somehow). * Choosing a function which is ready to run has no overhead (we'll need a queue, a stack, or some similar data structure to manage ready instructions). * The extra instructions do not interfere in any material way with the instruction level parallelism that may have existed in our program (doubtful, all those scheduling instructions, operand loads, and result stores consume issue bandwidth at the very least).

There's been quite a bit of work on splitting non-strict code up into strict threads which can be run in much the way I think you're imagining. The following paper describes some of the basics: @inproceedings{141568, author = {Kenneth R. Traub and David E. Culler and Klaus E. Schauser}, title = {Global analysis for partitioning non-strict programs into sequential threads}, booktitle = {Proceedings of the 1992 ACM conference on LISP and functional programming}, year = {1992}, isbn = {0-89791-481-3}, pages = {324--334}, location = {San Francisco, California, United States}, doi = {http://doi.acm.org/10.1145/141471.141568}, publisher = {ACM Press}, }

The real killer, of course, is memory latency. The cache resources required to hold pending work can't be used to hold heap data and the like instead. You're going to increase your cache miss rate as well. The result? A drop in performance.

A multi-threaded processor can help in masking cache miss latency (while increasing the miss rate of an individual thread). But it is still limited by the fundamental bandwidth limitations of the load/store units. You will, on the whole, still be doing substantially worse than an ordinary program.

Why bother at all? The same reason we bother with laziness (which has essentially the same set of problems in different clothing). It's easy to write clean, beautiful programs. And I do still hold out hope that we can find enough parallelism to make the idea of parallel execution realistic.

Hardware manufacturers have hit the limit for pure sequential execution speed,
so more parallelism is the only way forward (see Intels revised roadmap, they
abandoned the pentium 4 and 5 and have focused on an updated _low_power_
pentium 3M, and are planning multi core versions for more speed).

C and other imperitive languages focus toom
much on  the how and not the what to be easy to use in such multi-cpu
environments... A language with abstracts and hides the parallelism could
well take off in a big way.

It's a laudable goal that's remained just out of reach for a very long time. As you can probably tell, I've though a lot about it. I think there's still a great deal to be learned. But I'd encourage you---and anyone thinking about the problem---to understand the problems that have stymied some very smart people.

Keean.


-Jan-Willem Maessen

_______________________________________________
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] optimising for vector units

Reply via email to