On 11-nov-10, at 20:41, Russel Winder wrote:
On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote:
[ . . . ]
on this I am not so sure, heterogeneous clusters are more difficult
to
program, and GPU & co are slowly becoming more and more general
purpose.
Being able to take advantage of those is useful, but I am not
convinced they are necessarily the future.
The Intel roadmap is for processor chips that have a number of cores
with different architectures. Heterogeneity is not going going to
be a
choice, it is going to be an imposition. And this is at bus level,
not
at cluster level.
Vector co processors, yes I see that, and short term the effect of
things like AMD fusion (CPU/GPU merging).
Is this necessarily the future? I don't know, neither does intel I
think, as they are still evaluating larabee.
But CPU/GPU will stay around fro some time more for sure.
[ . . . ]
yes many core is the future I agree on this, and also that
distributed
approach is the only way to scale to a really large number of
processors.
Bud distributed systems *are* more complex, so I think that for the
foreseeable future one will have a hybrid approach.
Hybrid is what I am saying is the future whether we like it or not.
SMP
as the whole system is the past.
I disagree that distributed systems are more complex per se. I
suspect
comments are getting so general here that anything anyone writes can
be
seen as both true and false simultaneously. My perception is that
shared memory multithreading is less and less a tool that applications
programmers should be thinking in terms of. Multiple processes with
an
hierarchy of communications costs is the overarching architecture with
each process potentially being SMP or CSP or . . .
I agree that on not too large shared memory machines a hierarchy of
tasks is the correct approach.
This is what I did in blip.parallel.smp. Using that one can have
fairly efficient automatic scheduling, and so forget most of the
complexities, and actual hardware configuration.
again not sure the situation is as dire as you paint it, Linux does
quite well in the HPC field... but I agree that to be the ideal OS
for
these architectures it will need more changes.
The Linux driver architecture is already creaking at the seams, it
implies a central monolithic approach to operating system. This falls
down in a multiprocessor shared memory context. The fact that the Top
500 generally use Linux is because it is the least worst option. M$
despite throwing large amounts of money at the problem, and indeed
bought some very high profile names to try and do something about the
lack of traction, have failed to make any headway in the HPC operating
system stakes. Do you want to have to run a virus checker on your HPC
system?
My gut reaction is that we are going to see a rise of hypervisors as
per
Tilera chips, at least in the short to medium term, simply as a bridge
from the now OSes to the future. My guess is that L4 microkernels
and/or nanokernels, exokernels, etc. will find a central place in
future
systems. The problem to be solved is ensuring that the appropriate
ABI
is available on the appropriate core at the appropriate time.
Mobility
of ABI is the critical factor here.
yes microkernels& co will be more and more important (but I wonder how
much this will be the case for the desktop).
ABI mobility?not so sure, for hpc I can imagine having to compile to
different ABIs (but maybe that is what you mean with ABI mobility)
[ . . . ]
Whole array operation are useful, and when possible one gains much
using them, unfortunately not all problems can be reduced to few
large
array operations, data parallel languages are not the main type of
language for these reasons.
Agreed. My point was that in 1960s code people explicitly handled
array
operations using do loops because they had to. Nowadays such code is
anathema to efficient execution. My complaint here is that people
have
put effort into compiler technology instead of rewriting the codes
in a
better language and/or idiom. Clearly whole array operations only
apply
to algorithms that involve arrays!
[ . . . ]
well whole array operations are a generalization of the SPMD
approach,
so I this sense you said that that kind of approach will have a
future
(but with a more difficult optimization as the hardware is more
complex.
I guess this is where the PGAS people are challenging things.
Applications can be couched in terms of array algorithms which can be
scattered across distributed memory systems. Inappropriate operations
lead to huge inefficiencies, but handles correctly, code runs very
fast.
About MPI I think that many don't see what MPI really does, mpi
offers
a simplified parallel model.
The main weakness of this model is that it assumes some kind of
reliability, but then it offers
a clear computational model with processors ordered in a linear of
higher dimensional structure and efficient collective communication
primitives.
Yes MPI is not the right choice for all problems, but when usable it
is very powerful, often superior to the alternatives, and programming
with it is *simpler* than thinking about a generic distributed
system.
So I think that for problems that are not trivially parallel, or
easily parallelizable MPI will remain as the best choice.
I guess my main irritant with MPI is that I have to run the same
executable on every node and, perhaps more importantly, the message
passing structure is founded on Fortran primitive data types. OK so
you
can hack up some element of abstraction so as to send complex
messages,
but it would be far better if the MPI standard provided better
abstractions.
PGAS and MPI both have the same executable everywhere, but MPI is more
flexible, with respect of making different part execute different
things, and MPI does provide more generic packing/unpacking, but I
guess I see you problems with it.
Having the same executable is a big constraint, but is also a
simplification.
[ . . . ]
It might be a personal thing, but I am kind of "suspicious" toward
PGAS, I find a generalized MPI model better than PGAS when you want
to
have separated address spaces.
Using MPI one can define a PGAS like object wrapping local storage
with an object that sends remote requests to access remote memory
pieces.
This means having a local server where this wrapped objects can be
"published" and that can respond in any moment to external
requests. I
call this rpc (remote procedure call) and it can be realized easily
on
the top of MPI.
As not all objects are distributed and in a complex program it does
not always makes sense to distribute these objects on all processors
or none, I find that the robust partitioning and collective
communication primitives of MPI superior to PGAS.
With enough effort you probably can get everything also from PGAS,
but
then you loose all its simplicity.
I think we are going to have to take this one off the list. My
summary
is that MPI and PGAS solve different problems differently. There are
some problems that one can code up neatly in MPI and that are ugly in
PGAS, but the converse is also true.
Yes I guess that is true
[ . . . ]
The situation is not so dire, some problems are trivially parallel,
or
can be solved with simple parallel patterns, others don't need to be
solved in parallel, as the sequential solution if fast enough, but I
do agree that being able to develop parallel systems is increasingly
important.
In fact it is something that I like to do, and I thought about a lot.
I did program parallel systems, and out of my experience I tried to
build something to do parallel programs "the way it should be", or
at
least the way I would like it to be ;)
The real question is whether future computers will run Word,
OpenOffice.org, Excel, Powerpoint fast enough so that people don't
complain. Everything else is an HPC ghetto :-)
The result is what I did with blip, http://dsource.org/projects/
blip .
I don't think that (excluding some simple examples) fully automatic
(trasparent) parallelization is really feasible.
At some point being parallel is more complex, and it puts an extra
burden on the programmer.
Still it is possible to have several levels of parallelization, and
if
you program a fully parallel program it should still be possible to
use it relatively efficiently locally, but a local program will not
automatically become fully parallel.
At the heart of all this is that programmers are taught that algorithm
is a sequence of actions to achieve a goal. Programmers are trained
to
think sequentially and this affects their coding. This means that
parallelism has to be expressed at a sufficiently high level that
programmers can still reason about algorithms as sequential things.
when you have a network of things communicating (I think that once you
have a distributed system you come at that level) then i is not
sufficient anymore to think about each piece in isolation, you have to
think about the interactions too.
There are some patterns that might help reduce the complexity: client/
server, map/reduce,.... but in general it is more complex.