That's a little bit strong - OMPI still supports checkpoint/restart as a fault 
tolerance mechanism. There really isn't anything the sys admin has to do, 
though - what is required is that users periodically order their programs to 
checkpoint so they can be restarted after a failure.

Checkpointing is typically done either by the app itself (say, when it reaches 
some point it feels is a good one to save), or using a script that just orders 
a checkpoint every so many seconds.

What we have said is that we don't believe the FT "run thru failure" position 
pushed by UTK is particularly required at this time. Partly a question of 
impact vs benefit, mostly due to competing approaches offering equivalent fault 
recovery capability with less impact. But that's a separate discussion.


On Jun 19, 2012, at 11:16 AM, George Bosilca wrote:

> It has been clearly stated that the official position pushed forward by a 
> majority of the Open MPI developer community is that fault tolerance is not 
> needed so we (read this as the official version of Open MPI) do not support 
> it.
> 
> However, a group of researchers have been working toward a version of Open 
> MPI that supports the last fault tolerance proposal submitted for 
> consideration to the MPI Forum. You can access it at 
> https://bitbucket.org/jjhursey/ompi-ulfm-rts.
> 
>   george. 
> 
> On Jun 19, 2012, at 09:58 , 陈松 wrote:
> 
>> Hi all,
>> 
>> Can anyone explain me the fault tolerant features in OpenMPI? I've read the 
>> FAQs and some papers about this topic listed in open-mpi.org, but still 
>> can't figure out when one node of my supercomputer system fails down during 
>> computing, what would happen with the fault tolerant mechanism in OpenMPI, 
>> and what should we system administrator do after the failure (or before). 
>> 
>> Can anyone help me? My boss want me to deploy OpenMPI in our system cuz he 
>> want the fault tolerant feature.
>> 
>> Thanks very much.
>> 
>> 
>> 
>> ---------------
>> CHEN Song
>> R&D Department
>> National Supercomputer Center in Tianjin
>> Binhai New Area, Tianjin, China
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to