Open MPI's fault tolerance is still somewhat rudimentary; it's a complex topic 
within the entire scope of MPI.  There has been much research into MPI and 
fault tolerance over the years; the MPI Forum itself is grappling with terms 
and definitions that make sense.  It's by no means a "solved" problem.

It's unfortunately unsurprising that Open MPI may hang in the case of a node 
crash.  I wish that I had a better answer for you, but I don't.  :-\


On Sep 24, 2010, at 3:36 AM, Olivier Riff wrote:

> Hello,
> 
> My question concerns the display of error message generated by a throw 
> std::runtime_error("Explicit error message").
> I am launching on a terminal an openMPI program on several machines using:
> mpirun -v -machinefile MyMachineFile.txt MyProgram.
> I am wondering why I cannot see an error message displayed on the terminal 
> when one of my distant node (meaning not the node where the terminal is used) 
> is crashing. I was expecting that following try catch could also generates a 
> display in the terminal:
> try {...My code where a crash happens... } 
> {
>   throw std::runtime_error( "Explicit error message" );
> }
> 
> Generally, my problem is that one of the node crashes and the global 
> application waits forever data from this node. On the terminal, nothing is 
> displayed indicating that the node has crashed and generated a useful 
> information of the crash nature.
> 
> ( I don't think these information are relevant here, but just in case: I am 
> using openMPI 1.4.2, on a Mandriva 2008 system )
> 
> Thanks in advance for any help/info/advice.
> 
> Olivier
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to