It's not that there aren't solutions available for specific problems.. The
challenge is that some of the solutions don't scale well OR that they are
not generalized enough to handle the gamut of non-EP kinds of problems.

I don't think there will be a silver bullet that fixes everything, but I
think we'll evolve to some classes of solutions to solve certain classes
of problems.  After all, we don't do the same error correction codes on
memory and hard disk.

But the basic underlying comment is right:  a lot of HPC software design
assumes perfect hardware, or, that the hardware failure rate is
sufficiently low that a checkpoint/restart (or "do it all over from the
beginning") is an acceptable strategy.

This is fine.. It's hard enough to figure out how to
parallelize/clusterize the solution (having taken some decades to do it).
I'm confident that over the next few decades we'll figure out how to deal
with unreliable hardware/software.  (because, after all, software bugs are
a problem too)

On 11/23/12 2:29 AM, "Luc Vereecken" <[email protected]> wrote:

>At the same time, there are API (e.g. HTCondor) that do not assume
>successful communications or computation; they are used in large
>distributed computing projects (SETI@HOME, FOLDING@HOME, distributed.net
>(though I don't think they have a toolbox available)). For
>embarrassingly parallel workloads, they can be a good match; for tightly
>coupled workloads, not always.
>
>Luc
>
>
>
>On 11/23/2012 5:19 AM, Justin YUAN SHI wrote:
>> The fundamental problem rests in our programming API. If you look at
>> MPI and OpenMP carefully, you will find that these and all others have
>> one common assumption: the application-level communication is always
>> successful.
>>
>> We knew full well that this cannot be true.

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to