On Apr 22, 2009, at 11:43 PM, Tsung Han Shie wrote:

Unfortunately, after I thoroughly examined entire cluster, I found a bad node with busted hard drive. That's the reason why this job hanged. Also, when this job is sent with one bad node among the machinefile, neither the openmpi nor my program gives me any error messages. That's why I can't find the reason for job hanged.

Interesting.  Sorry OMPI didn't provide more diagnostics.  :-\

Did you get the information that you needed about the OpenFabrics optimization stuff?

Note that we released OMPI 1.3.2 yesterday that fixed the mpi_leave_pinned stuff, but also note that the treatment of the mpi_leave_pinned MCA parameter changed slightly. Please see this FAQ entry for details:

    
http://www.open-mpi.org/faq/?category=openfabrics#setting-mpi-leave-pinned-1.3.2

Also, since you did apparently mean v1.1.3, note that that version is ancient. Much has happened in Open MPI to improve scalability and performance (and diagnostics!) since the 1.1 series. If it's possible for you to upgrade, I encourage you to do so.

--
Jeff Squyres
Cisco Systems

Reply via email to