That's a little bit strong - OMPI still supports checkpoint/restart as a fault tolerance mechanism. There really isn't anything the sys admin has to do, though - what is required is that users periodically order their programs to checkpoint so they can be restarted after a failure.
Checkpointing is typically done either by the app itself (say, when it reaches some point it feels is a good one to save), or using a script that just orders a checkpoint every so many seconds. What we have said is that we don't believe the FT "run thru failure" position pushed by UTK is particularly required at this time. Partly a question of impact vs benefit, mostly due to competing approaches offering equivalent fault recovery capability with less impact. But that's a separate discussion. On Jun 19, 2012, at 11:16 AM, George Bosilca wrote: > It has been clearly stated that the official position pushed forward by a > majority of the Open MPI developer community is that fault tolerance is not > needed so we (read this as the official version of Open MPI) do not support > it. > > However, a group of researchers have been working toward a version of Open > MPI that supports the last fault tolerance proposal submitted for > consideration to the MPI Forum. You can access it at > https://bitbucket.org/jjhursey/ompi-ulfm-rts. > > george. > > On Jun 19, 2012, at 09:58 , 陈松 wrote: > >> Hi all, >> >> Can anyone explain me the fault tolerant features in OpenMPI? I've read the >> FAQs and some papers about this topic listed in open-mpi.org, but still >> can't figure out when one node of my supercomputer system fails down during >> computing, what would happen with the fault tolerant mechanism in OpenMPI, >> and what should we system administrator do after the failure (or before). >> >> Can anyone help me? My boss want me to deploy OpenMPI in our system cuz he >> want the fault tolerant feature. >> >> Thanks very much. >> >> >> >> --------------- >> CHEN Song >> R&D Department >> National Supercomputer Center in Tianjin >> Binhai New Area, Tianjin, China >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users