On Oct 27, 2006, at 11:36 AM, Michael L Torrie wrote:

Besides all this, computing is evolving to be distributed nowadays, with
a non-unified memory architecture.  Nodes do not share memory; they
communicate with a protocol.  There's a reason why in super-computing
MPI and other message-passing protocol schemes are king.  Threads
obviously don't make sense in any kind of distributed architecture. Now I believe that OSes and computing systems will be designed to hide this
fact from the programs, allowing normal programs to be spread
dynamically across nodes.  Maybe through some system that emulates
shared memory and local devices (mapping remote ones). Even in a system
that emulates shared memory (say by swapping pages of memory across
nodes), your threads may think they are not copying memory (accessing it
directly) but are not.  Besides that fact, I think it's probably a bad
idea to code with any particular assumptions about the underlying
machine architecture (vm or not).

There have been efforts to build distributed shared memory systems, but I think they are fundamentally misguided. Even with today's high speed, low-latency interconnect fabrics, remote memory access is still significantly slower than local memory access to the point that hiding it behind an abstraction layer is counterproductive. In order to predict the performance of your system, you still need to know exactly when an access is local and when it is remote. Considering that the point of these systems is high performance, abstracting away an important factor in performance is not particularly wise.

This is especially true the less tighly connected your compute nodes get. A multi-processor computer with a Hypertransport bus can probably get away with abstracting away local vs. remote memory access. In a multi-node cluster connected by an Infiniband fabric, latency differences between local and remote access become significant, but one can typically assume fairly low latency and fairly high reliability and bandwidth. A cluster with gigabit ethernet moves to higher latency and lower bandwidth, and a grid system consisting of nodes spanning multiple networks makes treating remote operations like local ones downright insane.

Add these details to the increased difficulty of programming in a shared-state concurrency system, and it starts to look like a pretty bad idea. There are plenty of established mechanisms for concurrency and distribution that work well and provide a model simple enough to reason about effectively. Letting people used to writing threaded code in C/C++/Java on a platform with limited parallelism carry their paradigms over to highly-parallel systems is NOT a good idea in this case. Retraining them to use MPI, tuple space, or some other reasonable mechanism for distributed programming is definitely worth the effort.

                --Levi

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Reply via email to