Open MPI's fault tolerance is still somewhat rudimentary; it's a complex topic within the entire scope of MPI. There has been much research into MPI and fault tolerance over the years; the MPI Forum itself is grappling with terms and definitions that make sense. It's by no means a "solved" problem.
It's unfortunately unsurprising that Open MPI may hang in the case of a node crash. I wish that I had a better answer for you, but I don't. :-\ On Sep 24, 2010, at 3:36 AM, Olivier Riff wrote: > Hello, > > My question concerns the display of error message generated by a throw > std::runtime_error("Explicit error message"). > I am launching on a terminal an openMPI program on several machines using: > mpirun -v -machinefile MyMachineFile.txt MyProgram. > I am wondering why I cannot see an error message displayed on the terminal > when one of my distant node (meaning not the node where the terminal is used) > is crashing. I was expecting that following try catch could also generates a > display in the terminal: > try {...My code where a crash happens... } > { > throw std::runtime_error( "Explicit error message" ); > } > > Generally, my problem is that one of the node crashes and the global > application waits forever data from this node. On the terminal, nothing is > displayed indicating that the node has crashed and generated a useful > information of the crash nature. > > ( I don't think these information are relevant here, but just in case: I am > using openMPI 1.4.2, on a Mandriva 2008 system ) > > Thanks in advance for any help/info/advice. > > Olivier > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/